Nesterovsky bros - Thinking aloud

Experience of parallel refactoring

Vladimir Nesterovsky — Fri, 26 Jan 2024 13:27:16 GMT

Introduction

We migrate code out of MF to Azure. Tool we use produces plain good functionally equivalent C# code. But it turns it's not enough!

So, what's the problem?

Converted code is very slow, especially for batch processing, where MF completes job, say in 30 minutes, while converted code finishes in 8 hours.

At this point usually someone appears and whispers in the ear:

Look, those old technologies are proven by the time. It worth to stick to old good Cobol, or better to Assembler if you want to do the real thing.

We're curious though: why is there a difference?

Turns out the issue lies in differences of network topology between MF and Azure solutions. On MF all programs, database and file storage virtually sit in a single box, thus network latency is negligible.

It's rather usual to see chatty SQL programs on MF that are doing a lot of small SQL queries.

In Azure - programs, database, file storage are different services most certainly sitting in different phisical boxes. You should be thankfull if they are co-located in a single datacenter. So, network latency immediately becomes a factor. Even if it just adds 1 millisecond per SQL roundtrip, it adds up in loops, and turns in the showstopper.

There is no easy workaround on the hardware level.

People advice to write programs differently: "Tune applications and databases for performance in Azure SQL Database".

That's a good advice for a new development but discouraging for migration done by a tool.

So, what is the way forward?

Well, there is one. While accepting weak sides of Azure we can exploit its strong sides.

Parallel refactoring

Before continuing let's consider a code demoing the problem:

 public void CreateReport(StringWriter
writer) { var index = 0;
foreach(var transaction in dataService.
GetTransactions(). OrderBy(item => (item.At, item.SourceAccountId))) { var sourceAccount
= dataService.GetAccount(transaction.SourceAccountId); var targetAccount
= transaction.TargetAccountId != null ? dataService.GetAccount(transaction.TargetAccountId)
: null; ++index; if (index
% 100 == 0) { Console.WriteLine(index);
} writer.WriteLine($"{index},{transaction.Id},{ transaction.At},{transaction.Type},{transaction.Amount},{
transaction.SourceAccountId},{sourceAccount?.Name},{ transaction.TargetAccountId},{targetAccount?.Name}");
} }

This cycle queries transactions, along with two more small queries to get source and target accounts for each transaction. Results are printed into a report.

If we assume query latency just 1 millisecond, and try to run such code for 100K transactions we easily come to 200+ seconds of execution.

Reality turns to be much worse. Program spends most of its lifecycle waiting for database results, and iterations don't advance until all work of previous iterations is complete.

We could do better even without trying to rewrite all code! Let's articulate our goals:

To make code fast.
To leave code recognizable.

The idea is to form two processing pipelines:

(a) one that processes data in parallel out of order;
(b) other that processes data serially, in original order;

Each pipeline may post sub-tasks to the other, so (a) runs its tasks in parallel unordered, while (b) runs its tasks as if everything was running serially.

So, parallel plan would be like this:

Queue parallel sub-tasks (a) for each transaction.
Parallel sub-task in (a) reads source and target accounts, and queues serial sub-task (b) passing transaction and accounts.
Serial sub-task (b) increments index, and writes report record.
Wait for all tasks to complete.

To reduce burden of task piplelines we use Dataflow (Task Parallel Library), and encapsulate everything in a small wrapper.

Consider refactored code:

 public void CreateReport(StringWriter
writer) { using var parallel = new Parallel(options.Value.Parallelism); //
⬅️ 1 var index = 0;
parallel.ForEachAsync( // ⬅️ 2 dataService. GetTransactions().
OrderBy(item => (item.At, item.SourceAccountId)), transaction => //
⬅️ 3 { var sourceAccount = dataService.GetAccount(transaction.SourceAccountId); var targetAccount
= transaction.TargetAccountId != null ? dataService.GetAccount(transaction.TargetAccountId)
: null; parallel.PostSync( //
⬅️ 4 (transaction, sourceAccount, targetAccount), //
⬅️ 5 data => { var (transaction,
sourceAccount, targetAccount) = data; //
⬅️ 6 ++index; if (index % 100 == 0)
{ Console.WriteLine(index); } writer.WriteLine($"{index},{transaction.Id},{
transaction.At},{transaction.Type},{transaction.Amount},{ transaction.SourceAccountId},{sourceAccount?.Name},{
transaction.TargetAccountId},{targetAccount?.Name}"); }); }); }

Consider ⬅️ points:

We create Parallel utility class passing degree of parallelism requested.
We iterate transactions using parallel.ForEachAsync() that queues parallel sub-tasks for each transaction, and then waits until all tasks are complete.
Each parallel sub-task recieves a transaction. It may be called from a different thread.
Having recieved required accounts we queue a sub-task for synchronous execution using parallel.PostSync(), and
Pass there data collected in parallel sub-task: transaction and accounts.
We deconstruct data passed into variables, and then proceed with serial logic.

What we achieve with this refactoring:

Top level query that brings transactions is done and iterated serially.
But each iteration body is run in parallel. By default we set it up to allow up to 100 parallel executions. All those parallel sub-task do not wait on each other so their waitings do not add up.
Sync sub-tasks are queued and executed in order of their serial appearance, so increments and report records are not a subject of race conditions, nor a subject of reordering of output records.

We think that such refactored code is still recognizible.

As for performance this is what log shows:

Serial test 100
... Execution time:
00:01:33.8152540 Parallel test 100
... Execution time:
00:00:05.8705468

Reference

Please take a look at project to understand implementation details, and in particular Parallel class implementing API to post parallel and serial tasks, run cycles and some more.

Please continue reading on GitHub.

Using Bing chat as a tool.

Vladimir Nesterovsky — Mon, 01 Jan 2024 14:02:01 GMT

As we wrote some time ago, we started to use AI and ML APIs in our applications, but this time we've tried to use Bing chat as a tool.

The initial task was to collect sequences of integer numbers and return a list of pairs integers where the first number is a range's start and the second one is a count of values in the range. For example, for sequence 1, 2, 3, 5, 8, 9 we'd like to get the following results: 1, 3 5, 1 8, 2

We both started to solve this issue. My brother started to play with SSMS using his deep knowledges in SQL and analytic functions, whereas me started to play with Bing chat. Looking ahead, I can state that we both solved the issue using almost the same SQL statement almost in the same time.

Here are my prompts and Bing's answers:

Prompt:: Act as database analyst and software engineer. Given the table (T) with a numeric field (N), which values are orderd ranges of integers. In a range each next value is greater than previous one on 1. When value differs from the previous one more than on 1 then in belongs to another range. Generate select statement that returns rows with two fields. The first field (R) is a number from which a range starts and the second field (C) is count elements of that range:
Answer:: Sure, I can help with that. Here's a SQL query that should do the trick. This query uses a self-join to identify the start of each range, and then groups by the start of the range to count the number of elements. ```sql SELECT start_of_range AS R, COUNT(*) AS C FROM ( SELECT N, N - ROW_NUMBER() OVER (ORDER BY N) AS start_of_range FROM T ) subquery GROUP BY start_of_range ORDER BY start_of_range; ``` In this query, `start_of_range` is calculated as the difference between the current value of `N` and its row number (when ordered by `N`). This difference will be the same for all `N` in the same range. Then we group by `start_of_range` to get the count of elements in each range. The outer query selects the start of each range (`R`) and the count of elements in that range (`C`).
Me, scratching the head, hmmm...
Question:: This is a bit incorrect SQL select. For example for values 1, 2, 3, 5, 8, 9 I'd like to get the following results: 1, 3 5, 1 8, 2 What a SQL select shall I use to get such results?
Answer:: I apologize for the confusion earlier. You're right, the previous query doesn't handle the scenario you've described correctly. Here's a corrected SQL query that should give you the desired output. This query uses a window function to compare each row with the previous one and identify the start of a new range. Then it groups by the start of the range to count the elements. ```sql WITH ranges AS ( SELECT N, CASE WHEN N - LAG(N, 1, N) OVER (ORDER BY N) > 1 THEN 1 ELSE 0 END AS is_start_of_range FROM T ), range_groups AS ( SELECT N, SUM(is_start_of_range) OVER (ORDER BY N) AS range_group FROM ranges ) SELECT MIN(N) AS R, COUNT(*) AS C FROM range_groups GROUP BY range_group ORDER BY R; ``` In this query, `is_start_of_range` is a flag that indicates the start of a new range (when the difference between the current and previous `N` is more than 1). `range_group` is a running total of `is_start_of_range` that serves as a unique identifier for each range. The outer query then groups by `range_group` to get the start (`R`) and count (`C`) of each range. I hope this helps! Let me know if you have any other questions.

That answer more satisfied me, and I've checked the following example:

 with T as ( select 10 N union
all select 11 N union all select 30 N union all select 31 N union all select 32 N
union all select 50 N union all select 80 N union all select 81 N union all select
82 N union all select 83 N union all select 90 N ), ranges AS ( SELECT N, CASE WHEN
N - LAG(N, 1, N) OVER (ORDER BY N) > 1 THEN 1 ELSE 0 END AS is_start_of_range FROM
T ), range_groups AS ( SELECT N, SUM(is_start_of_range) OVER (ORDER BY N) AS range_group
FROM ranges ) SELECT MIN(N) AS R, COUNT(*) AS C FROM range_groups GROUP BY range_group
ORDER BY R;

The result was:

 R C ------------ 10 2 30
3 50 1 80 4 90 1

Conclusion:
Nowdays, using AI as a tool, you may improve your performance to the expert level and, may be, study something new...

P.S. Don't fool youself that AI can replace a developer (see the first answer of the chat), but together they can be much stornger than separatelly.

P.P.S. Another interesting solution gave us our colleague. She used an additional field to avoid using analytics function, and she reached the same result:

 with T as ( select 10 ID,
1 N union all select 11 ID, 1 N union all select 30 ID, 1 N union all select 31 ID,
1 N union all select 32 ID, 1 N union all select 50 ID, 1 N union all select 80 ID,
1 N union all select 81 ID, 1 N union all select 82 ID, 1 N union all select 83 ID,
1 N union all select 90 ID, 1 N ), Groups AS ( SELECT ID, N, ROW_NUMBER() OVER (ORDER
BY ID) - ID AS GroupNumber FROM T ) SELECT MIN(ID) AS R, SUM(N) AS C FROM Groups GROUP
BY GroupNumber ORDER BY StartID;

Vector database

Vladimir Nesterovsky — Thu, 22 Jun 2023 22:01:45 GMT

Many years ago we implemented Akinator like engine purely within SQL Server.

Today we use exactly the same technique to implement vector database.

Please see our GitHub repo: vector-database.

ChatGPT

Arthur Nesterovsky — Thu, 23 Mar 2023 11:30:03 GMT

Last few days we play with OpenAI API and out of pure interest have asked about few slogans for our team. As an input we fed info from "About us" page. And as one of the first slogans we've gotten the following slogan that catgh our eyes:

Excellence Through Experience: Nesterovsky Bros.

ChatGPT is not a bad copywriter at all...

xml vs json

Vladimir Nesterovsky — Sat, 16 Apr 2022 19:03:04 GMT

Xslt is oftentimes thought as a tool to take input xml, and run transformation to get html or some xml on output. Our use case is more complex, and is closer to a data mining of big data in batch. Our transformation pipelines often take hour or more to run even with SSD disks and with CPU cores fully loaded with work.

So, we're looking for performance opportunities, and xml vs json might be promising.

Here are our hypotheses:

json is lighter than xml to serialize and deserialize;
json stored as map(*), array(*) and other items() are ligher than node() at runtime, in particular subtree copy is zero cost in json;
templates with match patterns are efficiently can be implemented with maps();
there is incremental way forward from use of xml to use of json.

If it pays off we might be switching xml format to json all over, even though it is a development effort.

But to proceed we need to commit an experiment to measure processing speed of xml vs json in xslt.

Now our task is to find an isolated small representative sample to prove or reject our hypotheses.

Better to start off with some existing transformation, and change it from use of xml to json.

The question is whether there is such a candidate.

Bad code

Vladimir Nesterovsky — Sat, 15 Jan 2022 13:55:08 GMT

Couple of days ago, while integrating with someones C# library, we had to debug it, as something went wrong. The code is big and obscure but for the integration purposes it's rather simple: you just create and call a class, that's all. Yet, something just did not work. We had to prove that it's not our fault, as the other side is uncooperative and would not run common debug session to resolve the problem.

To simplify the matter as much as possible here is the case:

var input = ...
var x = new X();
var output = x.Execute(input);

You pass correct input, and get correct output. Simple, right? But it did not work! So, we delved into the foreign code, and this is what we have seen:

class X: Y
{
  public Output Execute(Input input)
  {
    return Perform(input);
  }

  protected override Output Run(Input input)
  { 
    ...

     return output;
  }
}

class Y: Z
{
  ...
}

class Z
{
  protected Output Perform(Input input)
  {
    return Run(Input);
  }
        
  protected virtual Output Run(Input input)
  {
    return null;
  }
}

Do you see, still flow is simple, right? We call X.Execute(), it calls Z.Perform(), which in turn calls overriden X.Run() that returns the result.

But to our puzzlement we got null on output, as if Z.Run() was called!

We stepped through the code in debugger and confirmed that Z.Perform() calls Z.Run(), even though "this" instance is of type X.

How can it be? It's a nonsence! Yet, no overriden method was ever called.

No matter how much scrunity we applied to sources X and Z it just did not work.

We verified that the signature of X.Run() matches the signature of Z.Run(), so it overrides the method.

Then what do we see here?

And then enlightenment come! Yes, X.Run() overrides the method, but what method?

We looked closely at class Y, and bingo, we can see there following:

class Y: Z
{
  ...

  protected virtual Output Run(Input input)
  {
    return null;
  }
      
  ...
}

So, X.Run() overrides Y.Run() and not Z.Run()!

Per .NET Y.Run() and Z.Run() are two independant virtual methods, where Y.Run() in addition hides Z.Run().

IDE even issued a warning that it's better declare Y.Run() as:

protected new virtual Output Run(Input input)
{
  return null;
}

So, someones code was plainly wrong: Y.Run() had to use override rather than virtual.

We won, right?

Well, it's hard to call it a win.

We spent a hour looking at someones ugly code just to prove we're still sane.

So, what is conclusion of this story?

We think here it is:

be cautious while looking at someones code;
look at IDE warnings, don't disregard them, and try to resolve all of them in your code base;
try to write simple code.

CosmosDB with EF Core

Vladimir Nesterovsky — Wed, 20 Oct 2021 16:11:39 GMT

Lately we work great deal of time with Azure's CosmosDB.

This is how it's defined:

"It is schema-agnostic, horizontally scalable, and generally classified as a NoSQL database."

This, unconfident in itself, quote below is clarified as:

"The SQL API lets clients create, update and delete containers and items. Items can be queried with a read-only, JSON-friendly SQL dialect."

To be honest this SQL API made us favor CosmosDB.

So, we started a development with CosmosDB as a data storage.

The next development ingredient we learned the hard way is to try to refrain from clever techniques.

The lesson we learned is simple: after you finish with a project, provided it's not a toy, you give it to people who will be supporting it. You should think about those future developers before you're going to insert some cleverness in you code.

With this common sense we selected EF Core as a library that will serve as an interface between C# and the database.

Initialy all went well until we needed to store a list of strings as a document property and found it's not possible.

Why? - was a naive question.

Answer puzzled us a lot - because string is not an "Entity" (what ever it means), and EF is about framework of entities.

You could argue with this argument as long as you like, it just does not work. It is especially bad if you need to store a class that you do not directly control e.g. structures returned from other services.

Next pothole with EF was when we tried to run an innocent query that joins the data from document: e.g. document contains items, and you want to query some items from some documents.

Guess what?

Right, EF Core does not support it.

Why?

Because!

Later we have found that many other constructs and functions that you easily use in SQL dialect of CosmosDB are not possible or supported in EF Core.

We were very upset with those crutches and came to a conclusion that EF Core harms more than helps when you work with CosmosDB.

We went on and looked at how you work directly with CosmosDB client, and have found that it has all features ready:

you may give it SQL and bind parameters;
you may convert result items to objects;
you may create, delete, update and query data;

So, do we need EF Core?

We answered, no.

This does not mean we reject the value of EF Core but here our conclusion was that this API layer just complicated things instead making them simpler. It might be that EF Core for CosmosDB is not mature enough at this time.

Binary serialization in .NET 5.0

Vladimir Nesterovsky — Tue, 02 Feb 2021 12:39:37 GMT

Recently we have found that BinaryFormatter.Serialize and BinaryFormatter.Deserialize methods are marked as obsolete in .NET 5.0, and are declared dangerous:

The BinaryFormatter type is dangerous and is not recommended for data processing. Applications should stop using BinaryFormatter as soon as possible, even if they believe the data they're processing to be trustworthy. BinaryFormatter is insecure and can't be made secure.

See BinaryFormatter security guide for more details.

That guide along with its links go and expand on what problems BinaryFormatter poses. The schema of dangeous use cases, we have seen so far, is like that:

two different sides communicate to each other;
one side supplies input in BinaryFormatter's format;
other side reads input using BinaryFormatter and instantiates classes.

A threat arises when two sides cannot trust to each other or cannot establish trusted communication chanel. In these cases malign input can be supplied to a side reading the data, which might lead to unexpected code execution, deny of service, data exposure and to other bad consequences.

Arguing like this, today's maintainers of .NET concluded that it's better to tear down BinaryFormatter and similar APIs out of the framework.

Note that they don't claim BinaryFormatter itself, or Reflection API that it uses, as a core of the problem. They blame on communication.

Spelling out clearly what are concerns could help to everyone to better understand how to address it. In the area of security of communication there are a lot of ready solutions like:

use signature to avoid tampering the data;
use encription to avoid spying the data;
use access rights to avoid even access to the data;
use secure communication channels.

We can surely state that without applying these solutions no other serialization format is reliable and is subject of the same vulnerabilities.

After all it looked like an attempt to throw out the baby with the bath water. The good news is that thankfully to now modular structure of .NET runtime we're able to access binary serialization library, which are (and will be) available on nugets repositories. So, it's futile effort to erase this usefull API.

Collecting public data from Internet

Arthur Nesterovsky — Fri, 01 Jan 2021 14:34:37 GMT

Earlier we wrote that recently we've gotten few tasks related to Machine Learning. The prerequisites to such task is to collect and prepare the input data. Usually the required data is scattered across public sites, some of them are in plain text format (or close to it), but others are accessible as output of public applications. To obtain the required data for such sites you have to navigate thourgh pages, which often requires keeping state between navigations.

In order to implement this task you need some kind of crawler/scraper of the websites. Fortunately, there are a lot of frameworks, libraries and tools in C# (and in other languages too) that allow to do this (visit this or this site to see most popular of them), for example:

ScrapySharp
ABot
HtmlAgilityPack
DotnetSpider

There are pros and cons of using these libraries. Most crucial cons is a lack of support of rich UI based on heavy client-side scripts and client-side state support. Since not all such libraries implement fully browser emulation and even more, some of them do not support Javascript execution. So, they suit for gathering information from simple web pages, but no library allows to easy navigate to some page of a web application that keeps rich client-side state. Even best of them, like ScrapySharp, require heavy programming to achieve the result.

Then, suddenly, we've recalled that already for several years we're using Selenium and web drivers to automate web tests for AngularJS/Angular projects. After short discussion we came to conclusion that there is no big difference between testing web application and collecting data, since one of testing stages is collecting of actual results (data) from the tested page, and usually our tests consist of chains of actions performed on consequently visited pages.

This way we came to idea to use WebDriver API implemented by Selenium project. There are implementations of this API in different languages, and in C# too.

Using WebDriver we easily implement cumbersome navigation of a complex web application and can collect required data. Moreover, it allows to run WebDriver in screenless mode. Some of its features allow to create a snapshots of virtual screen and store HTML sources that would resulted of Javascript execution. These features are very useful during run-time troubleshooting. To create a complex web application navigation we need only a bit more knowledge than usual web application's user - we need to identify somehow pages' elements for example by CSS selectors or by id of HTML elements (as we do this for tests). All the rest, like coockies, view state (if any), value of hidden fields, some Javascript events will be transparent in this case.

Although one may say that approach with Selenium is rather fat, it's ought to mention that it is rather scalable. You may either to run several threads with different WebDriver instances in each thread or run several processes simultaneously.

However, beside pros there are cons in the solution with Selenium. They will appear when you'll decide to publish it, e.g. to Azure environment. Take a note that approach with Selenium requires a browser on the server, there is also a problem with Azure itself, as it's Microsoft's platform and Selenium is a product of their main competitor Google... So, some issues aren't techincals. The only possible solution is to use PaaS approach instead of SaaS, but in this case you have to support everything by yourself...

The other problem is that if your application will implement rather aggressive crawling, so either servers where you gather data or your own host might ban it. So, be gentle, play nice, and implement delays between requests.

Also, take into account that when you're implementing any crawler some problems may appear on law level, since not all web sites allow pull anything you want. Many sites use terms & conditions that defines rules for the site users (that you cralwer should follow to), otherwise legal actions may be used against them (or their owners in case of crawler). There is very interesting article that describes many pitfalls when you implement your own crawler.

To summarize everything we told early, the Selenium project could be used in many scenarios, and one of them is to create a powerful crawler.

Big Angular applications

Vladimir Nesterovsky — Tue, 29 Dec 2020 18:15:25 GMT

While doing Cool:GEN migratiotions to Java and C# we produce rather big Angular applications.

Everything is fine: server runs a REST APIs, and client is an Angular application with components per each window, dialog or screen. The only problem is with the word big.

We observe that enterprises that used Cool:GEN to develop their workflow come to migration stage with applications containing thousands of windows. In simple cases, after assessment, clients are able to split their monolith workflow into a set of independent applications. But even then we are dealing with Angular applications containing hundreds to many thousands components.

Now, lets look at Angular world. Best practices advice to (and actually almost force you to) use Ahead Of Time, Ivy compilation of all components and their templates.

Naive attempt to build such monolith Angular application will most surely fail. Angular uses nodejs for build, and chances are close to 100% of nodejs to run out of memory during the ng build.

You may fight and throw at it a capable build machine with 16 or better with 32GB of RAM, and instruct nodejs to use all of it.

Well, it's rather poor and extensive way of dealing with scale problems but it works.

Next hurdle you run into is time. We know it might take days just to build it.

You may ask why?

Well, angular is doing its best to validate templates!

Unfortunately the only viable workaround is to switch this nice feature off for such a big project.

With such setup you're able to build angular project in just 20-30 minutes!

Well, this is a big progress if you compare complete failure vs something that at least passes the build.

But what's next?

Sure, there are next problems:

scripts both development and production are of nonsense size: like several dozen megabytes for production, and some even higher number for development.
ng serve eats even more memory and builds even longer making nightmare out of development and support of such an application;
startup of such application, if it will start at all, is very slow.

So, what can be done? How can we create a manageable Angular application containing that many components?

Angular advices Lazy-loading feature modules.

That's reasonable advice!

We can create coarse grained modules containing subsets of components or fine grained modules containing one component.

So, does it help?

Yes, but it does not solve all problems:

ng build and ng serve are still very slow;
build produces multiple small scripts that are loaded on demand, so at least application works in browser.

Yet, other important problem is still there: we have multiple severly separated server REST controllers with components that represent them.

On the server side we have Java or C# REST controllers hosting business logic. They are separated per multiple projects probably managed by separate teams, and probably kept in separate GITs (or whatever). On the client side we have a fat angular project storing everything kept in single source repository.

This is not going to work from management perspective.

So the next step is try to split fat Angular project into multiple small projects. So, let's say we shall have 100 small angular libraries combinded by master project.

This is not going to work either due to nature of npm projects, as it will requre terabytes of cloned node_modules folders for each library, and many hours to build each of them.

It seems there is a room for improvments in npm area. There is no point to make dedicated copies of node_modules. It's much easier to have a local cache of artifacts.

So, what is the direction? How to create big angular project after all?

What we have here is:

a big enterprise level application;
it is modular but modules must work together to form desired workflow;
different modules are supported by different teams (both server and client side);
client Angular components correspond to REST controllers on the server.
all pages use the same styles and the same set of UI controls;

Looking from this perspective all development can be seen as:

development and support of unified styles and ui components that must be reused through the application;
development of server side and REST controllers that implement business logic;
development of templates of components (note that components themselves do nothing except expose their templates).

Studying this design suggests important and independent role of templates just like it is in AngularJS!

In contrast Angular templates are only a tool used by components. It's not obvious how to use thousands of templates without first building thousands components; neither it's obvious how to dynamically host those templates (routes do not help here).

Though not obvious it's still possible to do it though it requires use a bit lower level API than tutorials suggest. Ingredients are:

use of Just In Time (in contrast to Ahead Of Time) compilation, and use View Enginev (in contrast to Ivy);
use ViewContainerRef to host components dynamically;
Dynamic components and modules that you can create on demand using templates loaded e.g. through HttpClient.

To make things short we shall show the example of dynamic components in next article.

Here we shall emphasize that such design allows us to create small angular application that builds under 20 seconds with component templates served along with the REST controllers, and stored in the same Git.

So, if you say have a server subproject exposing REST controller say in the form: api/area/MyComponent then its template may be exposed as resource: page/area/MyComponent. Templates are loaded and compiled on demand at runtime thus making application light. At the same time templates may be cached in browser cache thus reducing number of roundtrips to the server.

Groups in SQL

Vladimir Nesterovsky — Wed, 05 Aug 2020 07:44:07 GMT

Recently our colleague turned to us and asked to help to deal with some complex query.

It has turned out that the complex part was to understand what he wants to achieve.

After listening to him we have forumulated the task in our words and have confirmed that that is what he wants.

So, that's the task in our formulation:

Assume you have events.
Each event acts upon one or more accounts.
Find all events that act on the same set of accounts.
Note we deal with mutiple millions of events and accounts.

Data is defined like this:

create table dbo.Event
(
  EventID bigint not null,
  AccountID varchar(18) not null,
  primary key(EventID, AccountID)
);

Requested query turned out to be very simple, yet, not as simple as one would think to account big amout of data:

< S2.EventID and
    S1.Items = S2.Items and
    S1.Hash = S2.Hash and
    not exists
    (
      select AccountID from D where EventID = S1.EventID
      except
      select AccountID from D where EventID = S2.EventID
    );
with D as ( select * from dbo.Event ), S as ( select EventID, count(*) Items, checksum_agg(checksum(AccountID)) Hash from D group by EventID ) select S1.EventID, S2.EventID from S S1 inner join S S2 on S1.EventID

The idea is to:

calculate a hash derived from list of accounts for each group;
join groups with the same hash;
verify that matched groups fit perfectly.

Even simpler solution that does not use hashes is not scaleable, as it's performance is slower than O(N^2), where N - is a number of events. It has unacceptable time with N ~1e4, nothing to say about N ~1e7.

At this point our colleague was already satisfied, as he got result in couple of minutes for a task that he could not even formalize as SQL.

But we felt it could be even better.

We looked at statistics:

with D as
(
  select * from dbo.Event
),
S as
(
  select
    EventID,
    count(*) Items
  from
    D
  group by
    EventID
)
select
  Items, count(*) EventCount
from
  S
group by
  Items
order by
  EventCount desc;

and have seen that most of the events, about 90%, deal with single account, and all other with two and more (some of them act upon big number of accounts).

The nature of the dataset gave us a hint of more verbose but more fast query:

< S2.EventID and
    S1.Items = 1 and
    S2.Items = 1 and
    S1.AccountID = S2.AccountID
union all
select
  S1.EventID, S2.EventID
from
  S S1
  inner join
  S S2
  on
    S1.EventID < S2.EventID and
    S1.Items >with D as
(
  select * from dbo.Event
),
S as
(
  select
    EventID,
    min(AccountID) AccountID,
    count(*) Items,
    checksum_agg(checksum(AccountID)) Hash
  from
    D
  group by
    EventID
)
select
  S1.EventID, S2.EventID
from
  S S1
  inner join
  S S2
  on
    S1.EventID  1 and
    S2.Items > 1 and
    S1.Items = S2.Items and
    S1.Hash = S2.Hash and
    not exists
    (
      select AccountID from D where EventID = S1.EventID
      except
      select AccountID from D where EventID = S2.EventID
    );

This query produced results in twenty seconds instead of couple of minutes for a dataset with ~1e7 rows.

Algorithm for Biconnected components

Vladimir Nesterovsky — Sun, 24 May 2020 12:15:02 GMT

While working on algorithm to trace Biconnected components for Graph API in the XSLT we realized that we implemented it unconventionally.

A pseudocode in Wikipedia is:

GetArticulationPoints(i, d)
    visited[i] := true
    depth[i] := d
    low[i] := d
    childCount := 0
    isArticulation := false

    for
each ni in adj[i] do if not visited[ni] then parent[ni]
:= i GetArticulationPoints(ni, d + 1) childCount := childCount + 1 if low[ni]
≥ depth[i] then isArticulation := true low[i] := Min (low[i], low[ni]) else
if ni ≠ parent[i] then low[i] := Min (low[i], depth[ni]) if (parent[i]
≠ null and isArticulation) or (parent[i] = null and childCount
> 1) then Output i as articulation point

That algorithm is based on the fact that connected graph can be represented as a tree of biconnected components. Vertices of such tree are called articulation points. Implementation deals with a depth of each vertex, and with a lowpoint parameter that is also related to vertex depth during Depth-First-Search.

Out of interest we approached to the problem from different perspective. A vertex is an articulation point if it has neighbors that cannot be combined into a path not containing this vertex. As well as classical algorithm we use Depth-First-Search to navigate the graph, but in contrast we collect cycles that pass through each vertex. If during back pass of Depth-First-Search we find not cycle from "child" to "ancestor" then it is necessary an articulation point.

Here is pseudocode:

GetArticulationPoints(v, p) -> result
    index = index + 1
    visited[v] = index 
    result = index
    articulation = p = null ? -1 : 0

    for
each n in neighbors of v except p do if visited[n]
= 0 then nresult = GetArticulationPoints(n, v) result = min(result, nresult) if nresult
>= visited[v] then articulation = articulation + 1 else result =
min(result, visited[n]) if articulation > 0 then Output v as articulation
point

Algorithms' complexity are the same.

What is interesting is that we see no obvious way to transform one algorithm into the other except from starting from Graph theory.

More is on Wiki.

On XSLT 4

Vladimir Nesterovsky — Tue, 19 May 2020 07:00:25 GMT

Michael Key's "A Proposal for XSLT 4.0" has spinned our interest in what could be added or changed in XSLT. This way we decided to implement Graph API purely in xslt. Our goal was to prove that:

it's possible to provide efficient implementation of different Graph Algorithms in XSLT;
to build Graph API the way that engine could provide native implementations of Grahp Algorithms.
to find through an experiments what could be added to XSLT as a language.

At present we may confirm that first two goals are reachable; and experiments have shown that XSLT could provide more help to make program better, e.g. we have seen that language could simplify coding cycles.

Graph algorithms are often expressed with while cycles, e.g "Dijkstra's algorithm" has:

12      while Q is not empty:
13          u ← vertex in Q with min dist[u]

body is executed when condition is satisfied, but condition is impacted by body itself.

In xslt 3.0 we did this with simple recursion:

<xsl:template name="f:while" as="item()*">
  <xsl:param name="condition" as="function(item()*) as xs:boolean"/>
  <xsl:param name="action" as="function(item()*) as item()*"/>
  <xsl:param name="next" as="function(item()*, item()*) as item()*"/>
  <xsl:param name="state" as="item()*"/>

  <xsl:if test="$condition($state)">
    <xsl:variable name="items" as="item()*" select="$action($state)"/>

    <xsl:sequence select="$items"/>

    <xsl:call-template name="f:while">
      <xsl:with-param name="condition" select="$condition"/>
      <xsl:with-param name="action" select="$action"/>
      <xsl:with-param name="next" select="$next"/>
      <xsl:with-param name="state" select="$next($state, $items)"/>
    </xsl:call-template>
  </xsl:if>
</xsl:template>

But here is the point. It could be done in more comprehended way. E.g. to let xsl:iterate without select to cycle until xsl:break is reached.

<xsl:iterate>
  <xsl:param name="name" as="..." value="..."/>
  
  <xsl:if test="...">
    <xsl:break/>
  </xsl:if>

  ...
</xsl:iterate>

So, what we propose is to let xsl:iterate/@select to be optional, and change the behavior of processor when the attribute is missing from compilation error to a valid behavior. This should not impact on any existing valid XSLT 3.0 program.

Graphs in XSLT

Vladimir Nesterovsky — Tue, 12 May 2020 06:08:51 GMT

Recently we've read an article "A Proposal for XSLT 4.0", and thought it worth to suggest one more idea. We have written a message to Michael Kay, author of this proposal. Here it is:

A&V
Historically xslt, xquery and xpath were dealing with trees. Nowadays it became much common to process graphs. Many tasks can be formulated in terms of graphs, and in particular any task processing trees is also graph task.

I suggest to take a deeper look in this direction.

As an inspiration I may suggest to look at "P1709R2: Graph Library" - the C++ proposal.

Michael Kay
I have for many years found it frustrating that XML is confined to hierarchic relationships (things like IDREF and XLink are clumsy workarounds); also the fact that the arbitrary division of data into "documents" plays such a decisive role: documents should only exist in the serialized representation of the model, not in the model itself.

I started my career working with the Codasyl-defined network data model. It's a fine and very flexible data model; its downfall was the (DOM-like) procedural navigation language. So I've often wondered what one could do trying to re-invent the Codasyl model in a more modern idiom, coupling it with an XPath-like declarative access language extended to handle networks (graphs) rather than hierarchies.

I've no idea how close a reinventiion of Codasyl would be to some of the modern graph data models; it would be interesting to see. The other interesting aspect of this is whether you can make it work for schema-less data.

But I don't think that would be an incremental evolution of XSLT; I think it would be something completely new.

A&V
I was not so radical in my thoughts.

Even C++ API is not so radical, as they do not impose hard requirements on internal graph representation but rather define template API that will work both with third party representations (they even mention Fortran) or several built-in implementations that uses standard vectors.

Their strong point is in algorithms provided as part of library and not graph internal structure (I think authors of that paper have structured it not the best way). E.g. in the second part they list graph algorithms: Depth First Search (DFS); Breadth First Search (BFS); Topological Sort (TopoSort); Shortest Paths Algorithms; Dijkstra Algorithms; and so on.

If we shall try to map it to xpath world them graph on API level might be represented as a user function or as a map of user functions.

On a storage level user may implement graph using a sequence of maps or map of maps, or even using xdm elements.

So, my approach is evolutional. In fact I suggest pure API that could even be implemented now.

Michael Kay
Yes, there's certainly scope for graph-oriented functions such as closure($origin, $function) and is-reachable($origin, $function) and find-path($origin, $destination, $function) where we use the existing data model, treating any item as a node in a graph, and representing the arcs using functions. There are a few complications, e.g. what's the identity comparison between arbitrary items, but it can probably be done.

A&V
> There are a few complications, e.g. what's the identity comparison between arbitrary items, but it can probably be done.

One approach to address this is through definition of graph API. E.g. to define graph as a map (interface analogy) of functions, with equality functions, if required:
map
{
  vertices: function(),
  edges: function(),
  value: function(vertex),
  in-vertex: function(edge),
  out-vertex: function(edge),
  edges: function(vertex),
  is-in-vertex: function(edge, vertex),
  is-out-vertex: function(edge, vertex)
  ...
}

Not sure how far this will go but who knows.

COBOL - raise from the dead or RIP?

Arthur Nesterovsky — Wed, 15 Apr 2020 06:00:51 GMT

It's not a secret that COVID-19 epidemic will change our world significantly. It impacts on economics and public services, especially on social services of our countries. We saw this in our country and now the same happens in US. Probably the same thing happens in all countries all over the world that suffer from COVID-19.

Usually, nowadays, such services are exposed online, but nobody expected such extreme loading of these services. And they start molder under such load. Programs start crash... and somebody has to fix them. It's just a temporary technical inconvenience when there is staff that familiar with such programs and technologies, but what about situation when programs and technologies are obsolete? When staff that may support them are about to retire due to ages, when knowledge were almost lost... It's very scary when such applications rules very important spheres of our life such social services, finances, medicine, defence etc. Something similar happened in US, so US government asked IBM about a help with their stuff written in COBOL.

Probably, in short term, they'll close the gaps, but taking in account the fact that epidemic won't dissolve in a month, there is a risk to still in the same hole... There are two ways to solve this issue in long term:

to make COBOL widely used program language and to teach enough programmers that will use it. This is exactly what IBM tries to do, see this article, but this way to nowhere, since it is not too popular in society of software developers.
to migrate such application to new technologies and new platform (e.g. Java or C# on UNIX/Windows). In this case organizations obtain scalable applications and ability to find human resources that may fix/modernize such applications step by step, in spare time, without loosing existing functionality. This is what our company Advanced may provide. And we are not alone. There are many such companies that may implement such migration on high level of quality.

And many professionals (even those that deal with COBOL on day by day basis) think that only 2nd way is viable. Let's see what will happen... More about the issue, see here.

Angular vs AngularJS. Practical comparison.

Vladimir Nesterovsky — Sat, 04 Apr 2020 12:37:15 GMT

People compare these two technologies, and it seems an established fact is that Angular is evolutionally more advanced framework. We're not going to contradict, contrary, we agree with it, but it's better for an opinion to be grounded on facts that one can evaluate and verify.

Fortunately we got a chance to make such a comparison.

We support conversions of Cool:GEN (a legacy CASE tool with roots in 80th) to java or C#. In its time Cool:GEN allowed to greatly automate enterprise development using Mainframes as a server side and Mainframe terminals or Win32 GUIs as clients.

The legacy of this tool are probably hundreds of business and database models, milions of programs generated on COBOL on Mainframes and on C or Java on Windows and Linux. All this runs to this time in many econimic sectors.

Usually the client is some enterprise that invested a lot into design, development and support of their business model using Cool:GEN but now most such clients a trying not to lose this legacy but to convert it into something that goes in parallel with todays technologies.

As original technology is sound, so it is possible to map it to todays Java or C# on server, REST or SOAP as a transport, and Angular, AngularJS or some other on client. Such automatic conversion is an essense of our conversions efforts.

To understand a scope consider a typical enterprise client that has 2-3 thousand windows that are backed by 20-30 thousand programs.

Now, consider that the conversion is done. On output among other things we produce a clean java or C# web application with REST and SOAP interface, and Angular or AngularJS web client that encapsulates those 2-3 thousand windows.

Each window definition is rather small 5-10 KB in html form, but whole mass of windows takes 10-30 MB, which is not small any more.

For AngularJS we generate just those html templates, but for Angular we need to generate separate components for each window that includes typescript class, template and style.

While amout of generated resource for AngularJS stays in those 10-30 MB, generated Angular takes at least 5-10 MB more.

The next step is build.

AngularJS builds distribution that includes all used libraries and a set of templates, and it takes something like a minute from the CPU. Produced output is about 300 KB minified script and those 10-30 MB of templates (multiple files with 5-10 KB each one).

Angular (here we talk about version 9) builds distribution that includes all used libraries and a set of compiled components that are to be loaded lazily on demand. Without of the both angular builder that performs tree shaking build takes days. With tree shaking off it takes 40 minutes. This is the first notable difference. Produced output for ES2015 (latest javascript) is about 1 MB, and 15-100 KB per each compiled component. This is the second notable difference that already impacts end user rather than developer.

The third difference is in the end user experience. Though we have built equalvalent Angular and AngularJS frontend we observe load time of angular is higher. This cannot only be ascribed to bigger file sizes. It seems internal initialization also takes more time for Angular.

So, our experience in this particular test shows that Angular has more room to improve. In particular: compile time, bundle size, runtime speed and simplicity of dynamic loading (we have strong cases when template compilation is not the best approach).

Theory vs practice

Vladimir Nesterovsky — Fri, 26 Jul 2019 16:38:11 GMT

We were asked to help with search service in one enterprise. We were told that their SharePoint portal does not serve their need. Main complaints were about the quality of search results.

They have decided to implement external index of SharePoint content, using Elastic, and expose custom search API within the enterprise.

We questioned their conclusions, asked why did they think Elastic will give much better results, asked did they try to figure out why SharePoint give no desired results.

Answers did not convince us though we have joined the project.

What do you think? Elastic did not help at all though they hoped very much that its query language will help to rank results in a way that matched documents will be found. After all they thought it was a problem of ranking of results.

Here we have started our analysis. We took a specific document that must be found but is never returned from search.

It turned to be well known problem, at least we dealt with closely related one in the past. There are two ingredients here:

documents that have low chances to be found are PDFs;
we live in Israel, so most texts are Hebrew, which means words are written from right to left, while some other texts from left to right. See Bi-directional text.

Traditionally PDF documents are provided in a way that only distantly resembles logical structure of original content. E.g., paragraphs of texts are often represented as unrelated runs of text lines, or as set of text runs representing single words, or independant characters. No need to say that additional complication comes from that Hebrew text are often represented visually (from left to right, as if "hello" would be stored as "olleh" and would be just printed from right to left). Another common feature of PDF are custom fonts with uncanonical mappings, or images with glyphs of letters.

You can implement these tricks in other document formats but for some reason PDF is only format we have seen that regularly and intensively uses these techniques.

At this point we have realized that it's not a fault of a search engine to find the document but the feature of the PDF to expose its text to a crawler in a form that cannot be used for search. In fact, PDF cannot search by itself in such documents, as when you try to find some text in the document opened in a pdf viewer, that you clearly see in the document, you often find nothing.

A question. What should you do in this case when no any PDF text extractor can give you correct text but text is there when you looking at document in a pdf viewer?

We decided it's time to go in the direction of image recognition. Thankfully, nowdays it's a matter of available processing resources.

Our goal was:

Have images of each PDF page. This task is immediately solved with Apache PDFBox (A Java PDF Library) - it's time to say this is java project.
Run Optical Character Recognition (OCR) over images, and get extracted texts. This is perfectly done by tesseract-ocr/tesseract, and thankfully to its java wrapper bytedeco/javacpp-presets we can immediately call this C++ API from java.

The only small nuisance of tesseract is that it does not expose table recognition info, but we can easily overcome it (we solved this task in the past), as along with each text run tesseract exposes its position.

What are results of the run of such program?

Full success! It works with high quality of recognition. Indeed, there is no any physical noise that impacts quality.
Slow speed - up to several seconds per recognition per page.
Scalable solution. Slow speed can be compensated by almost unlimited theoretical scalability.

So, what is the lesson we have taked from this experience?

Well, you should question yourself, test and verify ideas on the ground before building any theories that will lead you in completely wrong direction. After all people started to realize there was no need to claim on SharePoint, to throw it, and to spend great deal of time and money just to prove that the problem is in the different place.

A sample source code can be found at App.java

Scheduling algorithm for xsl:for-each/@saxon:threads=N

Vladimir Nesterovsky — Sun, 24 Mar 2019 07:52:02 GMT

This story started half year ago when Michael Kay, author of Saxon XSLT processor, was dealing with performance in multithreaded environment. See Bug #3958.

The problem is like this.

Given XSLT:

<xsl:stylesheet exclude-result-prefixes="#all" 
  version="3.0" 
  xmlns:saxon="http://saxon.sf.net/"
  xmlns:xs="http://www.w3.org/2001/XMLSchema" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="text" />

  <xsl:template name="main">
    <xsl:for-each saxon:threads="4" select="1 to 10">
      <xsl:choose>
        <xsl:when test=". eq 1">
          <!-- Will take 10 seconds -->
          <xsl:sequence select="
            json-doc('https://httpbin.org/delay/10')?url"/>
        </xsl:when>
        <xsl:when test=". eq 5">
          <!-- Will take 9 seconds -->
          <xsl:sequence select="
            json-doc('https://httpbin.org/delay/9')?url"/>
        </xsl:when>
        <xsl:when test=". eq 10">
          <!-- Will take 8 seconds -->
          <xsl:sequence select="
            json-doc('https://httpbin.org/delay/8')?url"/>
        </xsl:when>
      </xsl:choose>
    </xsl:for-each>
    <xsl:text>
</xsl:text>
  </xsl:template>
</xsl:stylesheet>

Implement engine to achieve best performance of parallel for-each.

Naive implementation that will distribute iterations per threads will run into unfair load on threads, so some load-balancing is required. That was the case Saxon EE.

Michael Kay has been trying to find most elegant way for the implementation and has written the comment:

I can't help feeling that the answer to this must lie in using the Streams machinery, and Spliterators in particular. I've spent another hour or so reading all about Spliterators, and I have to confess I really don't understand the paradigm. If someone can enlighten me, please go ahead...

We have decided to take the challange and to model the expected behavior using Streams. Here is our go:

import java.util.stream.IntStream;
import java.util.stream.Stream;
import java.util.function.Consumer;
import java.util.function.Function;

public class Streams
{
  public static class Item<T>
  {
    public Item(int index, T data)
    {
      this.index = index;
      this.data = data;
    }
    
    int index;
    T data;
  }

  public static void main(String[] args)
  {
    run(
      "Sequential",
      input(),
      Streams::action,
      Streams::output,
      true);
    
    run(
      "Parallel ordered", 
      input().parallel(),
      Streams::action,
      Streams::output,
      true);
    
    run(
      "Parallel unordered", 
      input().parallel(),
      Streams::action,
      Streams::output,
      false);    
  }
  
  private static void run(
    String description,
    Stream<Item<String>> input,
    Function<Item<String>, String[]> action,
    Consumer<String[]> output,
    boolean ordered)
  {
    System.out.println(description);
    
    long start = System.currentTimeMillis();
   
    if (ordered)
    {
      input.map(action).forEachOrdered(output);
    }
    else
    {
      input.map(action).forEach(output);
    }
    
    long end = System.currentTimeMillis();
    
    System.out.println("Execution time: " + (end - start) + "ms.");
    System.out.println();
  }
  
  private static Stream<Item<String>> input()
  {
    return IntStream.range(0, 10).
      mapToObj(i -> new Item<String>(i + 1, "Data " + (i + 1)));
  }
  
  private static String[] action(Item<String> item)
  {
    switch(item.index)
    {
      case 1:
      {
        sleep(10);
        
        break;
      }
      case 5:
      {
        sleep(9);
        
        break;
      }
      case 10:
      {
        sleep(8);
        
        break;
      }
      default:
      {
        sleep(1);
        
        break;
      }
    }
    
    String[] result = { "data:", item.data, "index:", item.index + "" };
    
    return result;
  }
  
  private synchronized static void output(String[] value)
  {
    boolean first = true;
    
    for(String item: value)
    {
      if (first)
      {
        first = false;
      }
      else
      {
        System.out.print(' ');
      }
    
      System.out.print(item);
    }

    System.out.println();
  }
  
  private static void sleep(int seconds)
  {
    try
    {
      Thread.sleep(seconds * 1000);
    }
    catch(InterruptedException e)
    {
      throw new IllegalStateException(e);
    }
  }
}

We model three cases:

"Sequential"

slowest, single threaded execution with output:

data: Data 1 index: 1
data: Data 2 index: 2
data: Data 3 index: 3
data: Data 4 index: 4
data: Data 5 index: 5
data: Data 6 index: 6
data: Data 7 index: 7
data: Data 8 index: 8
data: Data 9 index: 9
data: Data 10 index: 10
Execution time: 34009ms.

"Parallel ordered"

fast, multithread execution preserving order, with output:

data: Data 1 index: 1
data: Data 2 index: 2
data: Data 3 index: 3
data: Data 4 index: 4
data: Data 5 index: 5
data: Data 6 index: 6
data: Data 7 index: 7
data: Data 8 index: 8
data: Data 9 index: 9
data: Data 10 index: 10
Execution time: 10019ms.

"Parallel unordered"

fastest, multithread execution not preserving order, with output:

data: Data 6 index: 6
data: Data 2 index: 2
data: Data 4 index: 4
data: Data 3 index: 3
data: Data 9 index: 9
data: Data 7 index: 7
data: Data 8 index: 8
data: Data 5 index: 5
data: Data 10 index: 10
data: Data 1 index: 1
Execution time: 10001ms.

What we can add in conclusion is that xslt engine could try automatically decide what approach to use, as many SQL engines are doing, and not to force developer to go into low level engine details.

XPath through evolution

Vladimir Nesterovsky — Sat, 03 Nov 2018 20:59:28 GMT

Recently we observed how we solved the same task in different versions of XPath: 2.0, 3.0, and 3.1.

Consider, you have a sequence $items, and you want to call some function over each item of the sequence, and to return combined result.

In XPath 2.0 this was solved like this:

for $item in $items return
  f:func($item)

In XPath 3.0 this was solved like this:

$items!f:func(.)

And now with XPath 3.1 that defined an arrow operator => we attempted to write something as simple as:

$items=>f:func()

That is definitely not working, as it is the same as f:func($items).

Next attempt was:

$items!=>f:func()

That even does not compile.

So, finally, working expression using => looks like this:

$items!(.=>f:func())

This looks like a step back comparing to XPath 3.0 variant.

More than that, XPath grammar of arrow operator forbids the use of predictes, axis or mapping operators, so this won't compile:

$items!(.=>f:func()[1])

$items!(.=>f:func()!something)

Our conclusion is that arrow operator is rather confusing addition to XPath.

Xslt Streamability

Vladimir Nesterovsky — Tue, 02 Oct 2018 12:50:22 GMT

Xslt 3.0 defines a feature called streamability: a technique to write xslt code that is able to handle arbitrary sized inputs.

This contrasts with conventional xslt code (and xslt engines) where inputs are completely loaded in memory.

To make code streamable a developer should declare her code as such, and the code should pass Streamability analysis.

The goal is to define subset of xslt/xpath operations that allow to process input in one pass.

In simple case it's indeed a simple task to verify that code is streamable, but the more complex your code is the less trivial it's to witness it is streamable.

On the forums we have seen a lot of discussions, where experts were trying to figure out whether particular xslt is streamable. At times it's remarkably untrivial task!

This, in our opinion, clearly manifests that the feature is largerly failed attempt to inscribe some optimization technique into xslt spec.

The place of such optimization is in the implementation space, and not in spec. Engine had to attempt such optimization and fallback to the traditional implementation.

The last such example is: Getting SXST0060 "No streamable path found in expression" when trying to push a map with grounded nodes to a template of a streamable mode, where both xslt code and engine developers are not sure that the code is streamable in the first place.

By the way, besides streamability there is other optimization technique that works probably in all SQL engines. When data does not fit into memory engine may spill it on disk. Thus trading memory pressure for disk access. So, why didn't such techninque find the way into the Xslt or SQL specs?

C++ Exceptions

Vladimir Nesterovsky — Mon, 06 Aug 2018 12:00:10 GMT

For more than 25 years continues a discussion in C++ community about exceptions. In our opinion this can only be compared with math community and their open problems like Hilbert's 23 problems dated by 1900.

In essence C++ exception discussion is about efficiency of exceptions vs status codes. This problem is not so acute in other languages (like java or C#) because those languages postulate different goals.

C++ designers have introduced a zero-overhead principle for any language feature, which is:

If you don’t use some feature you don’t pay for it.
If you do use it you cannot (reasonably) write it more efficiently by hand.

Exceptions comparing to status codes do not withstand this demand. This led to the fragmentation of C++ comunity where many big projects and code styles ban exceptions partially or completely.

Make no doubt that all this time people were trying to make exceptions better, and have found techniques to make them space and run time efficient to some extent, but still, old plain status codes outperform both in speed (especially in predictability of time of exception handling logic) and in code size.

We guess the solution is finally found after the quarter the century of discussion!

WG paper: Zero-overhead deterministic exceptions: Throwing values by Herb Sutter. This "paper aims to extend C++’s exception model to let functions declare that they throw a statically specified type by value. This lets the exception handling implementation be exactly as efficient and deterministic as a local return by value, with zero dynamic or non-local overheads."

In other words author suggests to:

extend exception model (in fact implement additional one);
make exceptions as fast and as predicable as status codes (which virtually means designate a status code as a new exception);

Here are author's arguments:

Status code is usually just a number, and handling an error is like to perform some if or switch statements.
Handling errors with status codes is predicable in time and visible in the code though it burdens the logic very much.
Exceptions give clear separation between a control flow and error handling.
Exceptions are mostly invisible in the code, so it's not always clear how much they add to code size and how they impact performance.
Statistics show that exceptions add 15 to 30% to size of code, even if no exceptions are used (violation of zero-overhead principle);
Exceptions require Run Time Type Information about types, and have no predictable memory (stack or heap) footprint (violation of zero-overhead principle).

What aurhor suggests might seem too radical but at present it's only viable solution to reestablish zero-verhead principle and to reunite two C++ camps.

Here is the approach.

Clarify what should be considered as an exception.
1. Contract violation.
  Are contract violation like invalid values of arguments or invalid post conditions (unhold invariants) are exceptions or programmer's bugs?
  
  If later then it's best to terminate, as you cannot correctly recover from bug.
2. Virtual Machine fault.
  What user program can do with stack overflow?
  
  The best according to the author it to terminate.
3. OOM - Out Of Memory error.
  What is the best way to deal with OOM dyring dynamic allocation.
  
  At present there are two operators:
  - new - to allocate memory dynamically and throw bad_alloc on failure.
  - new(nothrow) - to allocate memory dynamically and return nullptr on failure.
  Herb Sutter suggests to change new behavior to terminate on failure (it is very hard to properly handle bad_alloc anyway), while new(nothrow) will still allow to build code that can deal with OOM.
4. Partial success
  This should never be reported as an error, and status codes should be used to report the state.
5. Error condition distinct from any type of success.
  This is where exceptions should be used.
Statistics shows that with such separation more than 90% of what curently is an exception will not be exception any more, so no any hidden exception logic is required: program either works or terminates.
Refactor exception
Redefine what exception object is and how it is propagated.

It should be thin value type. At minimum it needs to contain an error code. Suggested size is up to a couple of pointers.

Compiler should be able to cheaply allocate and copy it on the stack or even in the processor's registers.

Such simple exception type resolves problems with lifetime of exception object, and makes exception handling as light as checking status codes.

Exception should be propagated through return chanel, so it's like a new calling convention that defines either function result or error outcome.

It's not our intention to quote whole the paper here. If you're interested then please read it. Here we want to outline our concerns.

Exception payload.
This paper emphasizes that exception type should be small.

So, what to do with exception payload, if any (at least error message if it's not a static text)?

If this won't be resolved then developers will start to invent custom mechanisms like GetLastErrorMessage().

And what to do with aggregate exceptions?

We think this must be clearly addressed.
Implemntation shift.
We can accept that most of the current exceptions will terminate.

But consider now some container that serves requests, like web container or database.

It may be built from multiple components and serve multiple requests concurently. If one request will terminate we don't want for container to terminate.

If terminate handler is called then we cannot rely on state of the application. At least we can expect heap leaks and un-released resources.

So, we either want to be able release heap and other resources per request, or we want to go down with whole process and let OS deal with it.

In the later case we need to design such containers differently: as a set of cooperative processes; OS should allow to spin processes more easily.
VM with exceptions
There are Virtual Machines that allow exception to be thrown on each instruction (like JVM, or CLI).

You cannot tell in this case that code would never throw exception, as it can out of the blue!

Event in x86 you can have PAGE FAULT on memory access, which can be translated into an exception.

So, it's still a question whether the terminate() solution is sound in every case, and whether compiler can optimize out exception handling if it proves staticlly that no exception should be thrown.

SQL Server vs DB2

Vladimir Nesterovsky — Sat, 14 Apr 2018 19:38:20 GMT

We often deal with different SQL DBs, and in particular DB2, Oracle, and SQL Server, and this is what we have found lately.

Our client has reported a problem with SQL insert into the DB2:

subject table has a small number of columns, but large number of rows;
insert should attempt to insert a row but tolerate the duplicate.

The prototype table is like this:

create table Link(FromID int, ToID int, primary key(FromID, ToID));

DB2 SQL insert is like this:

insert into Link(FromID, ToID)
values(1, 2)
except
select FromID, ToID from Link;

The idea is to have empty row set to insert if there is a duplicate.

SQL Server variant looks like this:

insert into Link(FromID, ToID)
select 1, 2
except
select FromID, ToID from Link;

Client reported ridiculously slow performance of this SQL, due to table scan to calculate results of except operator.

Out of interest we performed the same experiment with SQL Server, and found the execution plan is optimal, and index seek is used to check duplicates. See:

The only reasonable way of dealing with such DB2's peculiarity, except trying to insert and handle duplicate exception, was to qualify select with where clause:

insert into Link(FromID, ToID)
values(1, 2)
except
select FromID, ToID from Link where FromID = 1 and ToID = 2;

We think DB2 could do better.

Some fun with java switch

Vladimir Nesterovsky — Tue, 22 Aug 2017 15:18:29 GMT

Today we wanted to write some code in java that performs some or the other action depending on a condition. At the same time if some action fails we wanted to fall back to the other action.

We've written it like this:

switch(boolean_expression)
{
  case true:
  {
    try
    {
      // Some actions.
      break;
    }
    catch(Exception e)
    {
      // Fall back to false route. 
    }
  }
  case false:
  {
    // Other actions.
    break;
  }
}

The fun part is that it's not valid java code.

Why?

The answer can be found in spec: 14.11. The switch Statement

The type of the Expression must be char, byte, short, int, Character, Byte, Short, Integer, String, or an enum type (§8.9), or a compile-time error occurs.

But why?

Who knows...

Sure there are workarounds, even with switch, but it just not justified restriction...

Saxon HE map and array types.

Vladimir Nesterovsky — Tue, 16 May 2017 06:20:48 GMT

We have found that Saxon HE 9.7.0-18 has finally exposed partial support to map and array item types. So, now you can encapsulate your data in sequence rather than having a single sequence and treating odd and even elements specially.

Basic example is:

<xsl:stylesheet version="3.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:t="t"
  xmlns:map="http://www.w3.org/2005/xpath-functions/map"
  exclude-result-prefixes="xs t map">

  <xsl:template match="/">
    <xsl:variable name="map" as="map(xs:string, xs:string)" select="
      map 
      {
        'Su': 'Sunday',
        'Mo': 'Monday',
        'Tu': 'Tuesday',
        'We': 'Wednesday',
        'Th': 'Thursday',
        'Fr': 'Friday',
        'Sa': 'Saturday'
      }"/>
      
     <xsl:message select="map:keys($map)"/>
  </xsl:template>  

</xsl:stylesheet>

A list of map functions can be found here http://www.w3.org/2005/xpath-functions/map/, though not all are available, as Saxon HE still does not allow inline functions.

P.S. From the development perspective it's a great harm that Saxon HE is so limited. Basically limited to xslt 2.0 + some selected parts of 3.0.

View on tunnel parameters in XSLT

Vladimir Nesterovsky — Sun, 26 Mar 2017 04:21:36 GMT

Lately we do not program in XSLT too often but rather in java, C#, SQL and javascript, but from time to time we have tasks in XSLT.

People claim that those languages are too different and use this argument to explain why XSLT is only a niche language. We, on the other hand, often spot similarities between them.

So, what it is in other languages that is implemented as tunnel parameters in XSLT?

To get an answer we reiterated how they work in XSLT, so, you:

define a template with parameters marked as tunnel="yes";
use these parameters the same way as regular parameters;
pass template parameters down to other templates marking them as tunnel="yes";

The important difference of regular template parameters from tunnel parameters is that the tunnel parameters are implicitly passed down the call chain of templates. This means that you:

define your API that is expected to receive some parameter;
pass these parameters somewhere high in the stack, or override them later in the stack chain;
do not bother to propagate them (you might not even know all of the tunnel parameters passed, so encapsulation is in action);

As a result we have a template with some parameters passed explicitly, and some others are receiving values from somewhere, usually not from direct caller. It’s possible to say that these tunnel parameters are injected into a template call. This resembles a lot injection API in other languages where you configure that some parameters are prepared for you by some container rather then by direct caller.

Now, when we have expressed this idea it seems so obvious but before we thought of this we did not realize that tunnel parameters in XSLT and Dependency Injection in other languages are the same thing.

Dependency Injection

Vladimir Nesterovsky — Thu, 06 Oct 2016 11:27:42 GMT

Our genuine love is C++. Unfortunately clients don't always share our favors, so we mostly occupied in the C#, java and javascript. Nevertheless, we're closely watching the evolution of the C++. It became more mature in the latest specs.

Recently, we wondered how would we deal with dependency injection in C++. What we found is only strengthened our commitment to C++.

Parameter packs introduced in C++ 11 allow trivial implementation of constructor injection, while std::type_index, std::type_info and std:any give service containers.

In fact there are many DI implementations out there. The one we refer here is Boost.DI. It's not standard nor we can claim it's the best but it's good example of how this concept can be implemented.

So, consider their example seen in Java with CDI, in C# in .NET Core injection, and in C++:

Java:

@Dependent
public class Renderer 
{
  @Inject @Device
  private int device;
};

@Dependent
public class View 
{
  @Inject @Title
  private String title;
  @Inject
  private Renderer renderer;
};

@Dependent
public class Model {};

@Dependent
public class Controller 
{
  @Inject
  private Model model;
  @Inject
  private View view;
};

@Dependent
public class User {};

@Dependent
public class App 
{
  @Inject
  private Controller controller;
  @Inject
  private User user;
};

...
  Privider<App> provider = ...

  App app = provider.get();

C#:

public class RenderedOptions
{
  public int Device { get; set; }
}
    
public class ViewOptions
{
  public int Title { get; set; }
}
    
public class Renderer 
{
  public Renderer(IOptions<RendererOptions> options)
  {
    Device = options.Device;
  }

  public int Device { get; set; }
}

public class View 
{
  public View(IOptions<ViewOptions> options, Renderer renderer)
  {
    Title = options.Title;
    Renderer = renderer;
  }

  public string Title { get; set; }
  public Renderer Renderer { get; set; }
}

public class Model {}

public class Controller 
{
  public Controller(Model model, View view) 
  {
    Model = model;
    View = view;
  }

  public Model Model { get; set; }
  public View View { get; set; }
};

public class User {};

public class App 
{
  public App(Controller controller, User user) 
  {
    Controller = controller;
    User = user;
  }

  public Controller Controller { get; set; }
  public User User { get; set; }
};

...
  IServiceProvider serviceProvider = ...

  serviceProvider.GetService<App>();

C++:

#include <boost/di.hpp>

namespace di = boost::di;

struct renderer 
{
  int device;
};

class view 
{
public:
  view(std::string title, const renderer&) {}
};

class model {};

class controller 
{
public:
  controller(model&, view&) {}
};

class user {};

class app 
{
public:
  app(controller&, user&) {}
};

int main()
{
  /**
   * renderer renderer_;
   * view view_{"", renderer_};
   * model model_;
   * controller controller_{model_, view_};
   * user user_;
   * app app_{controller_, user_};
   */

  auto injector = di::make_injector();
  injector.create<app>();
}

What is different between these DI flavors?

Not too much from the perspective of the final task achieved.

In java we used member injection, with qualifiers to inject scalars.

In C# we used constructor injection with Options pattern to inject scalars.

In C++ we used constructor injection with direct constants injected.

All technologies have their API to initialize DI container, but, again, while API is different, the idea is the same.

So, expressiveness of C++ matches to those of java and C#.

Deeper analysis shows that java's CDI is more feature rich than DI of C# and C++, but, personally, we consider it's advantage of C# and C++ that they have such a light DI.

At the same time there is an important difference between C++ vs java and C#.

While both java and C# are deemed to use reflection (C# in theory could use code generation on the fly to avoid reflection), C++'s DI natively constructs and injects services.

What does it mean for the user?

Well, a lot! Both in java and in C# you would not want to use DI in a performance critical part of code (e.g. in a tight loop), while it's Ok in C++ due to near to zero performance impact from DI. This may result in more modular and performant code in C++.

ASP.NET Core Session

Vladimir Nesterovsky — Wed, 28 Sep 2016 19:22:15 GMT

While reading on ASP.NET Core Session, and analyzing the difference with previous version of ASP.NET we bumped into a problem...

At Managing Application State they note:

Session is non-locking, so if two requests both attempt to modify the contents of session, the last one will win. Further, Session is implemented as a coherent session, which means that all of the contents are stored together. This means that if two requests are modifying different parts of the session (different keys), they may still impact each other.

This is different from previous versions of ASP.NET where session was blocking, which meant that if you had multiple concurrent requests to the session, then all requests were synchronized. So, you could keep consistent state.

In ASP.NET Core you have no built-in means to keep a consistent state of the session. Even assurances that the session is coherent does not help in any way.

You options are:

build your own synchronization to deal with this problem (e.g. around the database);
decree that your application cannot handle concurrent requests to the same session, so client should not attempt it, otherwise behaviour is undefined.

Dao for programmers.

Arthur Nesterovsky — Mon, 29 Feb 2016 18:33:04 GMT

8 Ways to Become a Better Coder is a good article. Read and apply to yourself. Never mind what your occupation is. Replace "coder" with your profession. Suits to everybody who wants to be the best.

Pull visitor pattern

Vladimir Nesterovsky — Tue, 09 Feb 2016 12:37:10 GMT

Visitor pattern is often used to separate operation from object graph it operates with. Here we assume that the reader is familiar with the subject.

The idea is like this:

The operation over object graph is implemented as type called Visitor.
Visitor defines methods for each type of object in the graph, which a called during traversing of the graph.
Traversing over the graph is implemented by a type called Traverser, or by the Visitor or by each object type in the graph.

Implementation should collect, aggregate or perform other actions during visit of objects in the graph, so that at the end of the visit the purpose of operation will be complete.

Such implementation is push-like: you create operation object and call a method that gets object graph on input and returns operation result on output.

In the past we often dealt with big graphs (usually these are virtual graphs backended at database or at a file system).

Also having a strong experience in the XSLT we see that the visitor pattern in OOP is directly mapped into xsl:template and xsl:apply-templates technique.

Another thought was that in XML processing there are two camps:

SAX (push-like) - those who process xml in callbacks, which is very similar to visitor pattern; and
XML Reader (pull-like) - those who pull xml components from a source, and then iterate and process them.

As with SAX vs XML Reader or, more generally, push vs pull processing models, there is no the best one. One or the other is preferable in particular circumstances. E.g. Pull like component fits into a transformation pipeline where one pull component has another as its source; another example is when one needs to process two sources at once, which is untrivial with push like model. On the other hand push processing fits better into Reduce part of MapReduce pattern where you need to accumulate results from source.

So, our idea was to complete classic push-like visitor pattern with an example of pull-like implementation.

For the demostration we have selected Java language, and a simplest boolean expression calculator.

Please follow GitHub nesterovsky-bros/VisitorPattern to see the detailed explanation.

Error during transformation in Saxon 9.7 - Continue

Vladimir Nesterovsky — Mon, 04 Jan 2016 10:07:12 GMT

Essence of the problem (see Error during transformation in Saxon 9.7, thread on forum):

XPath engine may arbitrary reorder predicates whose expressions do not depend on a context position.
While an XPath expression $N[@x castable as xs:date][xs:date(@x) gt xs:date("2000-01-01")] cannot raise an error if it's evaluated from the left to right, an expression with reordered predicates $N[xs:date(@x) gt xs:date("2000-01-01")][@x castable as xs:date] may generate an error when @x is not a xs:date.

To avoid a potential problem one should rewrite the expression like this: $N[if (@x castable as xs:date) then xs:date(@x) gt xs:date("2000-01-01") else false()].

Please note that the following rewrite will not work: $N[(@x castable as xs:date) and (xs:date(@x) gt xs:date("2000-01-01"))], as arguments of and expression can be evaluated in any order, and error that occurs during evaluation of any argument may be propageted.

With these facts we faced a task to check our code base and to fix possible problems.

A search has brought ~450 instances of XPath expessions that use two or more consequtive predicates. Accurate analysis limited this to ~20 instances that should be rewritten. But then, all of sudden, we have decided to commit an experiment. What if we split XPath expression in two sub expressions. Can error still resurface?

Consider:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xsl:variable name="elements" as="element()+"><a/><b value="c"/></xsl:variable> <xsl:template match="/"> <xsl:variable name="a" as="element()*" select="$elements[self::d or self::e]"/> <xsl:variable name="b" as="element()*" select="$a[xs:integer(@value) = 1]"/> <xsl:sequence select="$b"/> </xsl:template> </xsl:stylesheet>

As we expected Saxon 9.7 internally assembles a final XPath with two predicates and reorders them. As result we get an error:

Error at char 20 in xsl:variable/@select on line 8 column 81 of Saxon9.7-filter_speculation.xslt: FORG0001: Cannot convert string "c" to an integer

This turn of events greately complicates the code review we have to commit.

Michiel Kay's answer to this example:

I think your argument that the reordering is inappropriate when the expression is written using variables is very powerful. I shall raise the question with my WG colleagues.

In fact we think that either: reordering of predicates is inappropriate, or (weaker, to allow reordering) to treat an error during evaluation of predicate expression as false(). This is what is done in XSLT patterns. Other solutions make XPath less intuitive.

In other words we should use XPath (language) to express ideas, and engine should correctly and efficiently implement them. So, we should not be forced to rewrite expression to please implementation.

Error during transformation in Saxon 9.7

Vladimir Nesterovsky — Sat, 02 Jan 2016 21:32:16 GMT

On December, 30 we have opened a thread in Saxon help forum that shows a stylesheet generating an error. This is the stylesheet:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xsl:variable name="elements" as="element()+"><a/><b value="c"/></xsl:variable> <xsl:template match="/"> <xsl:sequence select="$elements[self::d or self::e][xs:integer(@value) = 1]"/> </xsl:template> </xsl:stylesheet>

We get an error:

Error at char 47 in xsl:sequence/@select on line 7 column 83 of Saxon9.7-filter_speculation.xslt: FORG0001: Cannot convert string "c" to an integer Exception in thread "main" ; SystemID: .../Saxon9.7-filter_speculation.xslt; Line#: 7; Column#: 47 ValidationException: Cannot convert string "c" to an integer at ...

It's interesting that error happens in Saxon 9.7 but not in earlier versions.

The answer we got was expected but disheartening:

The XPath specification (section 2.3.4, Errors and Optimization) explicitly allows the predicates of a filter expression to be reordered by an optimizer. See this example, which is very similar to yours:
The expression in the following example cannot raise a casting error if it is evaluated exactly as written (i.e., left to right). Since neither predicate depends on the context position, an implementation might choose to reorder the predicates to achieve better performance (for example, by taking advantage of an index). This reordering could cause the expression to raise an error.
$N[@x castable as xs:date][xs:date(@x) gt xs:date("2000-01-01")]

Following the spec, Michael Kay advices us to rewrite XPath:

$elements[self::d or self::e][xs:integer(@value) = 1]

like this:

$elements[if (self::d or self::e) then xs:integer(@value) = 1 else false()]

Such subtleties make it hard to reason about and to teach XPath. We doubt many people will spot the difference immediately.

We think that if such optimization was so much important to spec writers, then they had to change filter rules to treat failed predicates as false(). This would avoid any obscure differences in these two, otherwise equal, expressions. In fact something similar already exists with templates where failed evaluation of pattern is treated as un-match.

Akinator like engine

Vladimir Nesterovsky — Wed, 16 Dec 2015 12:33:41 GMT

A collegue has approached to us with a question on how Akinator engine may work.

To our shame we have never heard about this amazing game before. To fill the gap we have immediately started to play it, and have identified it as a Troubleshooting solver.

It took us a couple of minutes to come up with a brilliant solution: "We just need to google and find the engine in the internet".

Unfortunately, this led to nowhere, as no Akinator itself is open sourced, and no other good quality open source solutions are available.

After another hour we have got two more ideas:

The task should fit into SQL;
The task is a good candidate for a neural network.

In fact, the first might be required to teach the second, so we have decided to formalize the problem in terms of SQL, while still keeping in mind a neural network.

With this goal we have created a GitHub project. Please see the algorithm and its implementation at github.com/nesterovsky-bros/KB.

Database async trap

Vladimir Nesterovsky — Mon, 27 Jul 2015 06:44:45 GMT

Though ADO.NET and other ORM framworks like EntityFramework and Dapper support async pattern, you should remember that database drivers (at least all we know about) do not support concurrent db commands running against a single connection.

To see what we mean consider a bug we have recently identified. Consider a code:

await Task.WhenAll(
  newImages.
    Select(
      async image =>
      {
        // Load data from url.
        image.Content = await HttpUtils.ReadData(image.Url);

        // Insert image into the database.
        image.ImageID = await context.InsertImage(image);
      }));

The code runs multiple tasks to read images, and to write them into a database.

Framework decides to run all these tasks in parallel. HttpUtils.ReadData() has no problem with parallel execution, while context.InsertImage() does not run well in parallel, and is a subject of race conditions.

To workaround the problem we had to use async variant of a critical section. So the fixed code looks like this:

using(var semaphore = new SemaphoreSlim(1)) { await Task.WhenAll(
newImages. Select( async image => { // Load data from url. image.Content = await HttpUtils.ReadData(image.Url); await
semaphore.WaitAsync(); try { // Insert image into the database. image.ImageID
= await context.InsertImage(image); } finally { semaphore.Release(); } }));
}

So, in the async world we still should care about race conditions.

Multisite transclusion in Angularjs

Vladimir Nesterovsky — Mon, 04 May 2015 13:07:53 GMT

Having a strong experience in ASP.NET and JSF, we found angular's transclusion concept is obscure and counterintuitive. It took a while for both of us to grasp the transclude's ideas described the Developer Guide. We suspect that this is due to the bad design: a bad design leads to a bad wording.

The other consequence of the bad design is that the transclusion is limited to one template per directive, which limits the use of the feature.

Consider:

A directive my-page that encapsulates a page with menu and content.
my-page uses templateUrl: my-page.html to render the page.
my-page.html defines two sites where menu and page content have to be embedded.
Two content fragments are passed to my-page to fill content sites.

Unfortunately, you cannot immediately implement this design in angularjs. On the other hand ASP.NET's Master Pages, and JSF's ui:composition readily solve this task.

Here is one of JSF's approaches:

Define page template my-page.xhtml:
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:ui="http://java.sun.com/jsf/facelets" xmlns:h="http://java.sun.com/jsf/html"> <h:body> <table> <tr> <td><ui:insert name="menu"/></td> </tr> <tr> <td><ui:insert name="content"/></td> </tr> </table> </h:body> </html>
Use ui:composition tag to pass parts to the template:
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:ui="http://java.sun.com/jsf/facelets" xmlns:h="http://java.sun.com/jsf/html"> <h:body> <ui:composition template="my-page.xhtml"> <ui:define name="content"> My Content <ui:define> <ui:define name="menu"> <a href="#file">File</a> <a href="#edit">Edit</a> <a href="#view">View</a> <ui:define> </ui:composition> </h:body> </html>

We have decided to model angular directives after JSF, and have defined three simple directives: ui-template, ui-insert, ui-define (see angularjs-api/template/ui-lib.js).

To define a template one writes the following markup (see angularjs-api/template/my-page.html):

<table ui-template> <tr> <td ui-insert="menu"></td> </tr> <tr> <td ui-insert="content"></td> </tr> </table>

and declares a directive (see angularjs-api/template/my-page.js):

var myPage = { templateUrl: "my-page.html", transclude: true }; angular.module("app"). directive("myPage", function() { return myPage; });

and finally, to instantiate the directive one needs to write (see angularjs-api/template/sample.html):

<my-page> <div ui-define="content"> My content </div> <div ui-define="menu"> <a href="#file">File</a> <a href="#edit">Edit</a> <a href="#view">View</a> </div> </my-page>

The working sample can be seen through rawgit: sample.html

The other sample that integrates with routing can be found at sample-routing.html

Farewell Entity Framework

Vladimir Nesterovsky — Sun, 30 Nov 2014 12:47:46 GMT

Farewell Entity Framework and hello Dapper!

For many years we were using Entity Framework. It's still very popular and Microsoft's primary Object-Relational Mapper library.

Clearly, the decision is subjective but here are our arguments.

We know and love SQL, and think that in its domain it occupies strong positions. What SQL leaves out of scope is a bridge between itself and other languages. That's where ORM should help.

We strongly beleive that no ORM library should try to hide SQL behind the Object's language itself. We beleive in a separation of roles in development. Database design and Data Access Layer should be separated from client's logic. Thus, we strive, if possible, to encapulate data access through SQL functions and stored procedures.

Entity Framework, in contrast, tries to factor out SQL, giving a perspective of object graph to a client. Initially, it looks promising but at the end a developer should remember that any object query is mapped back to SQL. Without keeping this in mind either query won't compile, or performance will be poor.

E.g. This query will probably fail to build SQL, as no Regex can be mapped to SQL:

var result = context.Content. Where(data => Regex.IsMatch(data.Content, pattern)). ToArray();

This query might be slow, if no suitble SQL index is defined:

var result = context.Content. Where(data => data.Field == value). ToArray();

Thus no EF's goal is achieved completely, SQL power is limitted, and Data Access Layer is often fused into other client's logic.

We think that Entity Framework is over-engineered library, which tries to be more than ORM. Its generality often bumps into limits of SQL support in EF: SQL dialects, types, operators, functions, and so on. One can observe that people for years appeal to introduce support of xml, hierarchyid, geometry/geography types, full text search, and so on. This state cannot be different, as EF will never be able and does not aim to support all SQL features.

EF has both design-time and runtime. Each database vendor should implement their EF adapter for EF to play well with that database. This cooperation is not always smooth. E.g see Database first create entity framework 6.1.1 model using system.data.sqlite 1.0.93.

At some point the cost of dealing with EF has became too high for us, so we started to look into an alternatives: from plain ADO.NET to lighter ORM library.

To our delight we have immediately found: Dapper - a simple object mapper for .NET. It provides a simple extensions to IDBConnection interface to deal with mapping of query parameters to object properties, and of query results to plain types. Here are some examples:

// Get Customer var customer = connection. Query<Customer>("select * from Customers where CustomerId = @id", new { id = customerID }). ToSingle(); // Insert a value connection.Execute("insert into MyTable(A, B) values(@a, @b)", new { a = 2, b = 3 });

So, Dapper leaves you with plain SQL, which we consider as advantage.

Except beeing minimalistic compared to EF, Dapper claims performance close to pure hand written ADO.NET. Indeed, they build dynamic methods to populate parameters and to create rows instances, so reflection is used during warm up period only.

A Proposal to Add 2D Graphics Rendering and Display to C++

Vladimir Nesterovsky — Sun, 17 Aug 2014 08:56:08 GMT

Among latest C++ proposals the most ambiguous is N4021.

The goal of that proposal is "to define a 2D drawing API for the C++ programming language".

The motivation is going like this:

Today, computer graphics are pervasive in modern life, and are even replacing console-style I/O for basic user interaction on many platforms. For example, a simple cout << "Hello, world!" statement doesn’t do anything useful on many tablets and smartphones. We feel that C++ programmers should have a simple, standard way of displaying 2D graphics to users.

Authors compare several public and proprietary APIs to select the one named cairo graphics library as a base.

Reflecting on starting point they write:

Taken as a whole, starting from cairo allows for the creation of a 2D C++ drawing library that is already known to be portable, implementable, and useful without the need to spend years drafting, implementing, and testing a library to make sure that it meets those criteria.
...
An alternative design would be to create a new API via a synthesis of existing 2D APIs. This has the benefit of being able to avoid any perceived design flaws that existing APIs suffer from. Unfortunately this would not have implementation and usage experience. Further, doing so would not provide any guarantee that design flaws would not creep in.

What follows is a discussion on best way to transform that C library into std style C++ API.

Our thoughts on this proposal are threefold:

This proposal seems a decade or two late.
C++ standard should be modular to support basic and optional features.
We feel that programmers will not be satisfied with bare 2D graphics. It's not enough at nowadays.

Indeed, appeals to create standard C++ API for UI are as old as the C++'s standardization process. It's clear why did the committee not produce such API yet: they are bureaucracy that can approve API only. In fact it's a role of community to invent and implement libraries that may make their way into the standard. Without consensus in community no standard will reflect such API.

On the other hand C++ spec at present is too fat. Probably, not many people are satisfied with the pace of its evolution. Any big chunk of a new API makes the progress even slower. C++ spec should go through a refactoring and be split into core(s) and libraries and to allow individual progress of each part. This will simplify both specification and implementation. After that refactoring an API can be added or deprecated much more easily. In fact implementations were always like this. It's the spec that tries to be monolith.

As for a new 2D graphics API. It looks like an idea from late 90-es. We think that today's programmers (at least several samples ) wished to deal with industry standard UI API, and not to start from basic drawing. Looking around we observe that html 5 is such de-facto standard. Take into an account that it supports rich layout, svg, canvas, user input; in addition it's good for GPU optimization. Even if you want to deal with simple graphics then you can build svg markup or draw on the canvas.

So, what we rather prefer to see in the C++ spec is an html binding API (both for DOM and Javascript).

Just think of standard C++ program that uses html engine as its UI!

.NET Cache API

Vladimir Nesterovsky — Mon, 28 Jul 2014 05:36:06 GMT

Looking at Guava Cache we think its API is more convenient than .NET's Cache API.

Just consider:

.NET has getters, and setters of objects by string keys.
You should provide caching policy with each setter.
Guava cache operates with typed storage of Key to Value.
Provides a value factory and a caching policy in advance at cache construction.

Guava's advantange is based on an idea that homogenous storage assumes a uniform way of creation of values, and uniform caching policy. Thus a great part of logic is factored out into a cache initialization.

We have decided to create a simple adapter of the MemoryCache to achieve the same goal. Here is a result of such an experiment:

public class Cache<K, V> where V: class { /// <summary> /// A cache builder. /// </summary> public struct Builder { /// <summary> /// A memory cache. If not specified then MemoryCache.Default is used. /// </summary> public MemoryCache MemoryCache; /// <summary> /// An expiration value. /// Alternatively CachePolicyFunc can be used. /// </summary> public TimeSpan Expiration; /// <summary> /// Indicates whether to use sliding (true), or absolute (false) /// expiration. /// Alternatively CachePolicyFunc can be used. /// </summary> public bool Sliding; /// <summary> /// Optional function to get caching policy. /// Alternatively Expiration and Sliding property can be used. /// </summary> public Func<V, CacheItemPolicy> CachePolicyFunc; /// <summary> /// Optional value validator. /// </summary> public Func<V, bool> Validator; /// <summary> /// A value factory. /// Alternatively FactoryAsync can be used. /// </summary> public Func<K, V> Factory; /// <summary> /// Async value factory. /// Alternatively Factory can be used. /// </summary> public Func<K, Task<V>> FactoryAsync; /// <summary> /// A key to string converter. /// </summary> public Func<K, string> KeyFunc; /// <summary> /// Converts builder to a Cache<K, V> instance. /// </summary> /// <param name="builder">A builder to convert.</param> /// <returns>A Cache<K, V> instance.</returns> public static implicit operator Cache<K, V>(Builder builder) { return new Cache<K, V>(builder); } } /// <summary> /// Creates a cache from a cache builder. /// </summary> /// <param name="builder">A cache builder instance.</param> public Cache(Builder builder) { if ((builder.Factory == null) && (builder.FactoryAsync == null)) { throw new ArgumentException("builder.Factory"); } if (builder.MemoryCache == null) { builder.MemoryCache = MemoryCache.Default; } this.builder = builder; } /// <summary> /// Cached value by key. /// </summary> /// <param name="key">A key.</param> /// <returns>A cached value.</returns> public V this[K key] { get { return Get(key); } set { Set(key, value); } } /// <summary> /// Sets a value for a key. /// </summary> /// <param name="key">A key to set.</param> /// <param name="value">A value to set.</param> public void Set(K key, V value) { SetImpl(GetKey(key), IsValid(value) ? value : null); } /// <summary> /// Gets a value for a key. /// </summary> /// <param name="key">A key to get value for.</param> /// <returns>A value instance.</returns> public V Get(K key) { var keyValue = GetKey(key); var value = builder.MemoryCache.Get(keyValue) as V; if (!IsValid(value)) { value = CreateValue(key); SetImpl(keyValue, value); } return value; } /// <summary> /// Gets a task to return an async value. /// </summary> /// <param name="key">A key.</param> /// <returns>A cached value.</returns> public async Task<V> GetAsync(K key) { var keyValue = GetKey(key); var value = builder.MemoryCache.Get(keyValue) as V; if (!IsValid(value)) { value = await CreateValueAsync(key); SetImpl(keyValue, value); } return value; } /// <summary> /// Gets string key value for a key. /// </summary> /// <param name="key">A key.</param> /// <returns>A string key value.</returns> protected string GetKey(K key) { return builder.KeyFunc != null ? builder.KeyFunc(key) : key == null ? null : key.ToString(); } /// <summary> /// Creates a value for a key. /// </summary> /// <param name="key">A key to create value for.</param> /// <returns>A value instance.</returns> protected V CreateValue(K key) { return builder.Factory != null ? builder.Factory(key) : builder.FactoryAsync(key).Result; } /// <summary> /// Creates a task for value for a key. /// </summary> /// <param name="key">A key to create value for.</param> /// <returns>A task for a value instance.</returns> protected Task<V> CreateValueAsync(K key) { return builder.FactoryAsync != null ? builder.FactoryAsync(key) : Task.FromResult(builder.Factory(key)); } /// <summary> /// Validates the value. /// </summary> /// <param name="value">A value to validate.</param> /// <returns> /// true if value is valid for a cache, and false otherise. /// </returns> protected bool IsValid(V value) { return (value != null) && ((builder.Validator == null) || builder.Validator(value)); } /// <summary> /// Set implementation. /// </summary> /// <param name="key">A key to set value for.</param> /// <param name="value">A value to set.</param> /// <returns>A set value.</returns> private V SetImpl(string key, V value) { if (value == null) { builder.MemoryCache.Remove(key); } else { builder.MemoryCache.Set( key, value, builder.CachePolicyFunc != null ? builder.CachePolicyFunc(value) : builder.Sliding ? new CacheItemPolicy { SlidingExpiration = builder.Expiration } : new CacheItemPolicy { AbsoluteExpiration = DateTime.Now + builder.Expiration }); } return value; } /// <summary> /// Cache builder. /// </summary> private Builder builder; }

The use consists of initialization:

Cache<MyKey, MyValue> MyValues = new Cache<MyKey, MyValue>.Builder { KeyFunc = key => ...key to string value..., Factory = key => ...create a value for a key..., Expiration = new TimeSpan(0, 3, 0), Sliding = true };

and a trivial cache access:

var value = MyValues[key];

This contrasts with MemoryCache coding pattern:

MemoryCache cache = MemoryCache.Default; ...

var keyAsString = ...key to string value... var value = cache.Get(keyAsString) as MyValue; if (value == null) { value = ...create a value for a key... cache.Set(keyAsString, value, ...caching policy...); }

IEqualityComparer in LINQ

Vladimir Nesterovsky — Thu, 10 Jul 2014 20:31:42 GMT

Enumerable class contains many overloads with IEqualityComparable<T> argument. Most notable methods are:

Contains;
Distinct;
Except;
GroupBy;
Intersect;
Join;
ToDictionary;
ToLookup;
Union.

Recently we dealt with simple case:

source. Select( item => new Word { Text = ..., LangID = ..., Properties = ... ... }). Distinct(equality comparer by Text and LangID);

In other words how do you produce a enumeration of distinct words from a enumeration of words, where two words are qualified equal if their Text and LangID are equal?

It turns out it's cumbersome to implement IEqualityComparer<T> interface (and any other interface in C#), at least it's nothing close to a conciseness of lambda functions.

Here we've decided to step in into framework space and to introduce an API to define simple equality comparers for a class.

We start from the use case:

var wordComparer = KeyEqualityComparer.Null<Word>(). ThenBy(item => item.Text). ThenBy(item => item.LangID); ... source.Select(...).Distinct(wordComparer);

And then proceed to the API:

namespace NesterovskyBros.Linq { using System; using System.Collections; using System.Collections.Generic; /// <summary> /// A equality comparer extensions. /// </summary> public static class KeyEqualityComparer { /// <summary> /// Gets null as equality comparer for a type. /// </summary> /// <typeparam name="T">A type.</typeparam> /// <returns> /// null as equality comparer for a type. /// </returns> public static IEqualityComparer<T> Null<T>() { return null; } /// <summary> /// Creates an equality comparer for a enumeration item. /// </summary> /// <typeparam name="T">A type.</typeparam> /// <param name="source">A source items.</param> /// <param name="keyFunc">A key function.</param> /// <returns> /// null as equality comparer for a type. /// </returns> public static IEqualityComparer<T> EqualityComparerBy<T, K>( this IEnumerable<T> source, Func<T, K> keyFunc) { return new KeyEqualityComparer<T, K>(keyFunc); } /// <summary> /// Creates an equality comparer that uses this comparer as a base. /// </summary> /// <typeparam name="T">A type.</typeparam> /// <typeparam name="K">A key type.</typeparam> /// <param name="equalityComparer">A base equality comparer.</param> /// <param name="keyFunc">A key function.</param> /// <returns> /// An equality comparer that uses this comparer as a base. /// </returns> public static KeyEqualityComparer<T, K> ThenBy<T, K>( this IEqualityComparer<T> equalityComparer, Func<T, K> keyFunc) { return new KeyEqualityComparer<T, K>(keyFunc, equalityComparer); } } /// <summary> /// Equality comparer that uses a function to extract a comparision key. /// </summary> /// <typeparam name="T">A type.</typeparam> /// <typeparam name="K">A key type.</typeparam> public struct KeyEqualityComparer<T, K>: IEqualityComparer<T> { /// <summary> /// Creates an equality comparer. /// </summary> /// <param name="keyFunc">A key function.</param> /// <param name="equalityComparer">A base equality comparer.</param> public KeyEqualityComparer( Func<T, K> keyFunc, IEqualityComparer<T> equalityComparer = null) { KeyFunc = keyFunc; EqualityComparer = equalityComparer; } /// </summary> /// <param name="x">The first object of type T to compare.</param> /// <param name="y">The second object of type T to compare.</param> /// <returns> /// true if the specified objects are equal; otherwise, false. /// </returns> public bool Equals(T x, T y) { return ((EqualityComparer == null) || EqualityComparer.Equals(x, y)) && EqualityComparer<K>.Default.Equals(KeyFunc(x), KeyFunc(y)); } /// <summary> /// Returns a hash code for the specified object. /// </summary> /// <param name="obj"> /// The value for which a hash code is to be returned. /// </param> /// <returns>A hash code for the specified object.</returns> public int GetHashCode(T obj) { var hash = EqualityComparer<K>.Default.GetHashCode(KeyFunc(obj)); if (EqualityComparer != null) { var hash2 = EqualityComparer.GetHashCode(obj); hash ^= (hash2 << 5) + hash2; } return hash; } /// <summary> /// A key function. /// </summary> public readonly Func<T, K> KeyFunc; /// <summary> /// Optional base equality comparer. /// </summary> public readonly IEqualityComparer<T> EqualityComparer; } }

So, now you can easily build simple equality comparers to cache them or instantiate on the fly. This comparers are usually related to property values or their function of source values.

Async and scarce resources in C#

Vladimir Nesterovsky — Sat, 28 Jun 2014 11:56:32 GMT

This is a small post about refactoring lock statements in async methods.

Before refactoring we had a code like this:

lock(sync) { result = methodToRefactorIntoAsync(); } ... private object sync = new object();

Lock is bound to a thread, thus no way you to use it in async code. As an alternative you may use SemaphoreSlim class:

await sync.WaitAsync(cancellationToken); try { result = await methodAsync(cancellationToken); } finally { sync.Release(); } ... private SemaphoreSlim sync = new SemaphoreSlim(1, 1);

Client disconnects in ASP.NET Web API

Vladimir Nesterovsky — Fri, 27 Jun 2014 14:42:44 GMT

What will you do if you have async Web API method that runs on server for a some time but your client is dropped?

There are two solutions:

Run method to the end and allow to a framework to deal with disconnect;
Try to be notified about client's drop and to break early.

The first approach is simplest but might result in some overconsumption of server resources. The other method requires you to check client status from time to time.

Fortunatelly, ASP.NET provides a HttpResponse.ClientDisconnectedToken property, which is limited to IIS 7.5+ in integrated mode, but still fits our needs. So, you should request ClientDisconnectedToken, if any, and implement your async code using that token.

The following extension function gets that token:

using System.Linq; using System.Net.Http; using System.Threading.Tasks; using System.Threading; using System.Web; public static class HttpApiExtensions { public static CancellationToken GetCancellationToken( this HttpRequestMessage request) { CancellationToken cancellationToken; object value; var key = typeof(HttpApiExtensions).Namespace + ":CancellationToken"; if (request.Properties.TryGetValue(key, out value)) { return (CancellationToken)value; } var httpContext = HttpContext.Current; if (httpContext != null) { var httpResponse = httpContext.Response; if (httpResponse != null) { try { cancellationToken = httpResponse.ClientDisconnectedToken; } catch { // Do not support cancellation. } } } request.Properties[key] = cancellationToken; return cancellationToken; } }

And here is a Web API WordCount service described in the previous post:

using System; using System.Collections.Generic; using System.Linq; using System.Net; using System.Net.Http; using System.Threading; using System.Threading.Tasks; public class ValuesController: ApiController { public async Task<int> GetWordCount([FromUri(Name = "url")] string[] urls) { var cancellationToken = Request.GetCancellationToken(); using(var client = new HttpClient()) { return (await Task.WhenAll( urls.Select(url => WordCountAsync(client, url, cancellationToken)))).Sum(); } } public static async Task<int> WordCountAsync( HttpClient client, string url, CancellationToken cancellationToken) { string content = await (await client.GetAsync(url, cancellationToken)). Content.ReadAsStringAsync(); return WordCount(content); } private static int WordCount(string text) { var count = 0; var space = true; < text.Length; ++i) for (var i = 0; i { if (space != char.IsWhiteSpace(text[i])) { space = !space; if (!space) { ++count; } } } return count; } }

Though is simple there is a nuisance. You should pass cancellation token here and there, which adds to a pollution from async.

Parallel vs Async in C#

Vladimir Nesterovsky — Wed, 25 Jun 2014 12:51:28 GMT

Though parallel and async algorithms solve different tasks, they converge in some cases. And it's not always immediately clear what's the best.

Consider the following task: get a total word count contained in a given a set of urls.

At first we've solved it as a parallel task: indeed this fits to MapReduce pattern when you get urls' contents to count the number of words in parallel (Map), and then sum word counts per each url to get final result (Reduce). But then we decided that the very same MapReduce algorithm can be implemented with async.

This is a parallel word count:

public static int ParallelWordCount(IEnumerable<string> urls) { var result = 0; Parallel.ForEach( urls, url => { string content; using(var client = new WebClient()) { content = client.DownloadString(url); } var count = WordCount(content); Interlocked.Add(ref result, count); }); return result; }

Here is async word count:

public static async Task<int> WordCountAsync(IEnumerable<string> urls) { return (await Task.WhenAll(urls.Select(url => WordCountAsync(url)))).Sum(); } public static async Task<int> WordCountAsync(string url) { string content; using(var client = new WebClient()) { content = await client.DownloadStringTaskAsync(url); } return WordCount(content); }

And this is an implementation of word count for a text (it's less important for this discussion):

public static int WordCount(string text) { var count = 0; var space = true; < text.Length; ++i) for(var i = 0; i { if (space != char.IsWhiteSpace(text[i])) { space = !space; if (!space) { ++count; } } } return count; }

Our impressions are:

The parallel version is contained in one method, while the async one is implemeneted with two methods.

This is due to the fact that C# compiler fails to generate async labmda function. We attribute this to Microsoft who leads and implements C# spec. Features should be composable. If one can implement a method as a lambda function, and one can implement a method as async then one should be able to implement a method as an async lambda function.
Both parallel and async versions are using thread pool to run their logic.
While both implementations follow MapReduce pattern, we can see that async version is much more scaleable. It's because of parallel threads stay blocked while waiting for an http response. On the other hand async tasks are not bound to any thread and are just not running while waiting for I/O.

This sample helped us to answer the question as to when to use parallel and when async. The simple answer goes like this:

if your logic is only CPU bound then use parallel API;
otherwise use async API (this accounts I/O waits).

Async code in C#

Vladimir Nesterovsky — Mon, 23 Jun 2014 06:15:55 GMT

Not a long ago C# has introduced special language constructs to simplify asynchronous programming. It seems C++1x will follow async trend. But only recently when frameworks like ASP.NET Web API and Entity Framework started to catch up we've felt how it's to program with async and await keywords.

At first glance it seems it's a pure pleasure to write async methods:

private async Task SumPageSizesAsync() { // To use the HttpClient type in desktop apps, you must include a using directive and add a // reference for the System.Net.Http namespace. HttpClient client = new HttpClient(); // . . . byte[] urlContents = await client.GetByteArrayAsync(url); // . . . }

To dereference a Task<T> into T you just write await task_t_expression, mark your method with async specifier, and adjust output type (if not void) to Task or Task<Result>. Compiler applies its magic to convert your code into an asynchronous state machine.

We liked this feature and immediately have started to use it. But, as we said, async/await has shined in full when frameworks made it a core element, and at that point we have started to see that while async/await solve the task, they does not abstract the developer from implementation details, as a code gets considerably polluted.

Consider a method with pollution marked:

public static async Task<UserAuthorization> GetAuthorizationAsync(string accessToken) { var key = "oauth2:" + accessToken; var authorization = cache.Get<UserAuthorization>(key); if (authorization != null) { return authorization; } using(var model = new ModelContainer()) { authorization = (await model.UserAuthorizations. Where(item => item.AccessToken == accessToken). ToListAsync()). FirstOrDefault(); } if (authorization == null) { authorization = await ValidateAsync(accessToken); } cache.Set(key, cache.ShortDelay, authorization); return authorization; }

The more you use async, the more pollution you will see in your code. Ideally we would like to see the method quoted without any marked parts.

AngularJS

Vladimir Nesterovsky — Sun, 25 May 2014 08:02:41 GMT

After several years of experience with KendoUI we turned our attention to AngularJS. As many other libraries it has its strong and weak sides. There are many resources describing what AngularJS is, and what it is not. Our approach to study AngularJS was through an attempt to integrate it into an existing KendoUI web application.

It's rather straightforward to convert model from KendoUI into AngularJS, as logically both frameworks are equal in this regard. But tactically KendoUI implements model-view binding very differently than AngularJS does. KendoUI binds model to view immediately per each model field, where AngularJS delays a binding of each model field and performs whole model binding in one go. Angular's approach is more performant, and even more appealing to a developer, though the problem is that the time it takes to make whole model binding is proportional to a size (number of objects and properties) of model. This means that if you have a relatively big model you will experience tangible halts in browser's UI while a javascript updating view/model is running.

AngularJS advices some workaround, which in essence is to avoid big model. The problem is that a couple of thousands or even several hundrends of objects and properties are already considered big model. So, you should immediately plan your model, and view to avoid any potential impact. This seriously distracts from the task your're solving.

The idea that your UI will halt for the time proportional to the size of your whole model looks flawed in our opinion. KendoUI knows no such a problem. That's the reason why our KendoUI to AngularJS transition experience was not smooth.

Our analysis of AngularJS sources shows that the issue could be resolved provided model to view binding (it's called digest in that library) was asynchronous.

To verify our ideas we have created a branch nesterovsky-bros/angular.js where we implemented required refactorings. It includes:

API based on existing deferred/promise to write algorithms in async way, and
refactored digest logic.

At the end we have proposed to integrate our changes into the main branch: Make $digest async.

We're not sure whether our proposition will be integrated (rather no than yes). Nevertheless what we have come with is an interesting extension of deferred object that we neither have seen in AngularJS nor in JQuery, so later we will quote that API from q.js and scheduler.js.

Dealing with dynamic SQL in SQL Server

Vladimir Nesterovsky — Tue, 11 Feb 2014 09:48:07 GMT

These are initial positions for this writing:

SQL Server allows to execute dynamic SQL.
Dynamic SQL is useful and often unavoidable, e.g. when you have to filter or order data in a way that you cannot code efficiently in advance.
Dynamic SQL has proven to be a dangerous area, as with improper use it can open hole in a security.

In general nothing stops you from building and then excuting of SQL string. Our goal, however, is to define rules that make work with dynamic SQL is more managable and verifiable.

Here we outline these rules, and then give some examples and tips.

Rule #1. Isolate dynamic SQL

Put all logic related to building of dynamic SQL into a separate function.
We usually define a separate scheme Dynamic, and define functions like Dynamic.GetSQL_XXX(params).
This makes it simple to perform code review.

Rule #2. Xml as parameters

Use xml type to pass parameters to a function that builds dynamic SQL.
In many cases dynamic SQL depends on variable number of parameters (like a list of values to check against).
Xml fits here to represent structured information.
On a client (e.g. in C# or java) you can define a class with all parameters, populate an instance and serialize it to an xml.

Rule #3. XQuery as template language

Use XQuery to define SQL template and to generate SQL tree from the input parameters.
Here is an example of such XQuery:

@data.query(' <sql> select T.* from Data.Ticket T where { for $ticketID in data/ticketID return <sql>(T.TicketID = <int>{$ticketID}</int>) and </sql> } (1 = 1) </sql>')

You can see that output is an xml with sql element to represent literal SQL, and int element to represent integer literal.

In fact whole output schema can be defined like this:

<xs:schema elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="sql"/> <xs:element name="name"/> <xs:element name="string" nillable="true"/> <xs:element name="int" nillable="true"/> <xs:element name="decimal" nillable="true"/> <xs:element name="date" nillable="true"/> <xs:element name="time" nillable="true"/> <xs:element name="datetime" nillable="true"/> </xs:schema>

where sql is to represent literal content, name to represent a name, and other elements to represent different literal values.

Rule #4. Escape literals

Use function Dynamic.ToSQL(@template) to build final SQL text.
Here we quote the definition:

-- Builds a text of SQL function for an sql template. create function Dynamic.ToSQL ( -- SQL template. @template xml ) returns nvarchar(max) with returns null on null input as begin return ( select case when N.Node.exist('*[xs:boolean(@xsi:nil)]') = 1 then 'null' when N.Node.exist('self::int') = 1 then isnull(N.Node.value('xs:int(.)', 'nvarchar(max)'), '# int #') when N.Node.exist('self::string') = 1 then 'N''' + replace ( N.Node.value('.', 'nvarchar(max)'), '''', '''''' ) + '''' when N.Node.exist('self::name') = 1 then isnull ( quotename(N.Node.value('.', 'nvarchar(128)'), '['), '# name #' ) when N.Node.exist('self::datetime') = 1 then isnull ( 'convert(datetime2, ''' + N.Node.value('xs:dateTime(.)', 'nvarchar(128)') + ''', 126)', '# datetime #' ) when N.Node.exist('self::date') = 1 then isnull ( 'convert(date, ''' + N.Node.value('xs:date(.)', 'nvarchar(128)') + ''', 126)', '# date #' ) when N.Node.exist('self::time') = 1 then isnull ( 'convert(time, ''' + N.Node.value('xs:time(.)', 'nvarchar(128)') + ''', 114)', '# time #' ) when N.Node.exist('self::decimal') = 1 then isnull ( N.Node.value('xs:decimal(.)', 'nvarchar(128)'), '# decimal #' ) when N.Node.exist('self::*') = 1 then '# invalid template #' else N.Node.value('.', 'nvarchar(max)') end from @template.nodes('//sql/node()[not(self::sql)]') N(Node) for xml path(''), type ).value('.', 'nvarchar(max)'); end;

Now, we want to stress that this function plays an important role in prevention of the SQL injection, as it escapes literals from the SQL tree.

Rule #5 (optional). Collect data

Use SQL to collect additional data required to build dynamic SQL. Here is an example of how we get a Ticket by StatusID, while on input we receive a StatusName:

create function Dynamic.GetSQL_GetTicketByStatus(@data xml) returns nvarchar(max) as begin set @data = ( select @data, ( select T.StatusID from @data.nodes('/data/status') N(Node) inner join Metadata.Status T on T.StatusName = Node.value('.', 'nvarchar(128)') for xml auto, type, elements ) for xml path('') ); return Dynamic.ToSQL ( @data.query (' <sql> select T.* from Data.Ticket T where T.Status in ({ for $status in /T/StatusID return <sql><int>{$status}</int>,</sql> } null) </sql> ') ); end;

Notice code in red that collects some more data before calling XQuery.

Rule #6. Execute

The final step is to call dynamic SQL.
This is done like this:

-- build declare @sql nvarchar(max) = Dynamic.GetSQL_GetTicket(@data); -- execute execute sp_executesql @sql -- {, N'@parameter_name data_type [ OUT | OUTPUT ][ ,...n ]' } -- { , [ @param1 = ] 'value1' [ ,...n ] } with result sets ( ( TicketID int not null, CreatedAt datetime2 not null, Summary nvarchar(256) null, Status int, Severity int, DeadLineAt datetime2 null ) );

Notice that the use of dynamic SQL does not prevent static parameters.
Notice also that with result sets clause is used to specify output.

Example. Tickets system

Let's assume you're dealing with a tickets system (like Bugzilla), and you have a table Data.Ticket to describe tickets. Assume that DDL for this table is like this:

create table Data.Ticket ( TicketID bigint not null primary key, CreatedAt datetime2 not null, Summary nvarchar(128) null, Status int not null, UpdatedAt datetime2(7) not null )

Suppose you have to build C# code to search different tickets, where Entity Framework is used to access the database.
Search should be done by a range of CreatedAt, a range of UpdatedAt, Summary, or by different Status values. It should be possible to order results in different ways.

We start out solution from the C# and define classes for a request:

public enum Direction { Asc, Desc } public struct Order { public string Field { get; set; } public Direction Direction {get; set; } } public class DateRange { public DateTime? From { get; set; } // This property is to omit From element if value is null. // See rules for xml serialization. public bool FromSpecified { get { return From != null; } } public DateTime? To { get; set; } public bool ToSpecified { get { return To != null; } } } public class TicketsRequest { public DateRange CreatedAt { get; set; } public string Summary { get; set; } public DateRange UpdatedAt { get; set; } [XmlElement] public Order[] Order { get; set; } [XmlElement] public int[] Status { get; set; } }

Notice that we're going to use XmlSerializer to convert request to xml and then to pass parameter into EF's model. Here is utility method to perform such conversion:

public static string ToXmlString<T>(T value) { if (value == null) { return null; } var serializer = new XmlSerializer(typeof(T)); var builder = new StringBuilder(); var writer = XmlWriter.Create( builder, new XmlWriterSettings { OmitXmlDeclaration = true, Indent = false }); serializer.Serialize(writer, value); writer.Flush(); return builder.ToString(); }

Now we proceed to the database and define a procedure that runs the search:

-- Gets tickets. create procedure Data.GetTickets ( -- A query parameters. @params xml ) as begin set nocount on; -- This is for EF to guess type of result. if (1 = 0) begin select TicketID, CreatedAt, Summary, Status, UpdatedAt from Data.Ticket; end; declare @sql nvarchar(max) = Dynamic.GetSQL_GetTickets(@params); execute sp_executesql @sql with result sets ( ( TicketID int not null, CreatedAt datetime2 not null, Summary nvarchar(256) null, Status int, UpdatedAt datetime2 null ) ); end;

Switch back to C#, import the Data.GetTickets into the EF model, and create a search method:

public IEnumerable<Ticket> GetTickets(TicketsRequest request) { var model = new Model(); return model.GetTickets(ToXmlString(request)); }

The last ingredient is Dynamic.GetSQL_GetTickets() function.

create function Dynamic.GetSQL_GetTickets(@data xml) returns nvarchar(max) as begin return Dynamic.ToSQL ( @data.query(' <sql> select T.TicketID, T.CreatedAt, T.Summary, T.Status, T.UpdatedAt from Data.Ticket T where { for $range in */CreatedAt return ( for $date in $range/From return <sql> (T.CreatedAt >= <datetime>{$date}</datetime>) and </sql>, for $date in $range/To return <sql> (<datetime>{$date}</datetime> > T.CreatedAt) and </sql> ), for $range in */UpdatedAt return ( for $date in $range/From return <sql> (T.UpdatedAt >= <datetime>{$date}</datetime>) and </sql>, for $date in $range/To return <sql> (<datetime>{$date}</datetime> > T.UpdatedAt) and </sql> ), for $summary in */Summary return <sql> (T.Summary like <string>{$summary}</string>) and </sql>, if (*/Status) then <sql> T.Status in ({ for $status in */Status return <sql><int>{$status}</int>, </sql> } null) and </sql> else () } (1 = 1) order by { for $order in */Order [ Field = ("TicketID", "CreatedAt", "Summary", "UpdatedAt", "Status") ] return <sql> <name>{$order/Field}</name> {" desc"[$order[Direction = "Desc"]]}, </sql> } (select null) </sql> ') ); end;

SQL text from Dynamic.GetSQL_GetTickets()

Consider now SQL text produced by this function. For an input:

<TicketsRequest> <CreatedAt> <From>2014-01-01T00:00:00</From> </CreatedAt> <Summary>hello%</Summary> <Order> <Field>Status</Field> <Direction>Desc</Direction> </Order> <Status>1</Status> <Status>3</Status> </TicketsRequest>

the output is:

select T.TicketID, T.CreatedAt, T.Summary, T.Status, T.UpdatedAt from Data.Ticket T where (T.CreatedAt >= convert(datetime2, '2014-01-01T00:00:00', 126)) and (T.Summary like N'hello%') and T.Status in (1, 3, null) and (1 = 1) order by [Status] desc, (select null)

Though the text is not formatted as we would like, it's perfectly valid SQL.

Tips for building XQuery templates

What is called XQuery in SQL Server is in fact a very limited subset of XQuery 1.0. Microsoft clearly states this fact. What is trivial in XQuery is often impossible or ugly in XQuery of SQL Server.

Nevertheless XQuery in SQL Server works rather well as SQL template language. To make it most efficient, however, you should learn several tips.

Tip #1. Where clause

In template you might want to build a where clause:

<sql> select ... where { if (...) then <sql>...</sql> else () } </sql>

and it might happen that for a certain input a condition under where might collapse, and you will be left with where keyword without a real condition, which is wrong. A simple work around is to always add some true condition under ther where like this:

<sql> select ... where { if (...) then <sql>... and </sql> else () } (1 = 1) </sql>

Tip #2. "in" expression

If you want to generate "in" expression like this:

value in (item1, item2,...)

then you might find that it's much easier generate equivalent a code like this:

value in (item1, item2,..., null).

Here is a XQuery to generate such template:

value in ({ for $item in ... return <sql><int>{$item}</int>, </sql> } null) and

Tip #3. Order by

You can conclude an order by clause built from a data with a dummy expression like this:

order by { for $item in ... return <sql> <name>{$item/Field}</name> {" desc"[$item/Direction = "Desc"]}, </sql> } (select null)

Alternatively you can use first column from a clustered index.

Tip #4. Group by

In a group by clause we cannot introduce terminator expression as it was with order by, so a code is a less trivial:

{ let $items := ... return if ($items) then <sql> group by <name>{$items[1]}</name> { for $item in $items[position() > 1] return <sql>, <name>{$item}</name></sql> } </sql> else () }

In fact similar logic may work with order by.

Tip #5. Escape literals

It's crusial not to introduce SQL injection while building SQL. Thus use:

<int>{...}</int> - for literal int;
<decimal>{...}</decimal> - for literal decimal;
<string>{...}</string> - for literal string;
<datetime>{...}</datetime> - for literal datetime2;
<date>{...}</date> - for literal date;
<time>{...}</time> - for literal time;
<name>{...}</name> - for a name to quote.

Note that you can use xsi:nil, so <int xsi:nil="true"/> means null.

If you generate a field name from an input data then it worth to validate it against a list of available names.

Tip #6. Validate input.

It worth to define xml schema for an input xml, and to validate parameters against it.
This makes code more secure, and also adds a documentation.

Tip #7. Don't abuse dynamic SQL

There are not too many cases when you need a dynamic SQL. Usually SQL engine knows how to build a good execution plan. If your query contains optional conditions then you can write it a way that SQL Server can optimize, e.g.:

select * from T where ((@name is null) or (Name = @name)) and ((@date is null) or (Date = @date)) option(recompile)

Shared state

Vladimir Nesterovsky — Wed, 22 Jan 2014 19:43:25 GMT

Consider how would you implement Style object in the HTML DOM?

These are some characteristics of that object:

It has a long list of properties, e.g. in IE 11 there are more than 300 properties over a style object.
Any specific instance usually have only several properties assigned.
Reads of properties are much more frequent than writes. In fact style often stays unchanged after initialization.
DOM contains many style instances (often thousands).
The number of distinct instances in terms of values of properties is moderate (usually dozens).

Here is how would we approached to such an object.

1. Styles are sparse objects, thus there is no point to implement plain class with all those properties, as it's wasteful.

We would rather use two techniques to keep style's state:

A dictionary of properties with their values;
An aggregation of objects, where all properies are grouped into families, each group is defined by a separate type, and a style's state is an aggregation of that groups.

A current style of an element is an aggregation of styles of ancestor element. It can either by dynamic or be fused into a single style instance.

2. Make style's state immutable, and share all these states among all style instances.

In this implementation property write turns into a state transition operation: state = set(state, property, value). Thus no state is modified but replaced with other state that corresponds to a required change.

If state is seen as a dictionary then API may look like this :

public class State<K, V> { // Gets shared dictionary for an input dictionary. public IDictionary<K, V> Get(IDictionary<K, V> dictionary); // Gets a shared dictionary for an input dictionary with key set to a value. public IDictionary<K, V> Set(IDictionary<K, V> dictionary, K key, V value); // Gets a shared dictionary for an input dictionary. public IDictionary<K, V> Remove(IDictionary<K, V> dictionary, K key); // Gets typed value. public T Get<T>(IDictionary<K, V> dictionary, K key) where T: V { V value; if ((dictionary == null) || !dictionary.TryGetValue(key, out value)) { return default(T); } return (T)value; } // Sets or removes a typed value. // dictionary can be null. // null returned if output dictionary would be empty. public IDictionary<K, V> Set<T>(IDictionary<K, V> dictionary, K key, T value) where T : V { return value == null ? Remove(dictionary, key) : Set(dictionary, key, (V)value); } }

States can be cached. Provided the cache keeps states in a weak way, no unsued state will be stored for a long time. We may use weak table of dictionary to dictionary WeakTable<Dictionary<K, V>, Dictionary<K, V>> as a storage for such a cache. All required API is described in the WeakTable and Hash Code of Dictionary posts.

3. Style can be implemented as a structure with shared state as a storage. Here is a scetch:

[Serializable] public struct Style { // All properties. public enum Property { Background, BorderColor, BorderStyle, Color, FontFamily, FontSize, // ... } public int? Background { get { return states.Get<int?>(state, Property.Background); } set { state = states.Set(state, Property.Background, value); } } public int? BorderColor { get { return states.Get<int?>(state, Property.BorderColor); } set { state = states.Set(state, Property.BorderColor, value); } } public string BorderStyle { get { return states.Get<string>(state, Property.BorderStyle); } set { state = states.Set(state, Property.BorderStyle, value); } } public int? Color { get { return states.Get<int?>(state, Property.Color); } set { state = states.Set(state, Property.Color, value); } } public string FontFamily { get { return states.Get<string>(state, Property.FontFamily); } set { state = states.Set(state, Property.FontFamily, value); } } public double? FontSize { get { return states.Get<double?>(state, Property.FontSize); } set { state = states.Set(state, Property.FontSize, value); } } // ... [OnDeserialized] private void OnDeserialized(StreamingContext context) { state = states.Get(state); } // A state. private IDictionary<Property, object> state; // A states cache. private static readonly State<Property, object> states = new State<Property, object>(); }

Note that:

default state is a null dictionary;
states are application wide shared.

The following link is our implementation of State<K, V> class: State.cs.

Here we have outlined the idea of shared state object, and how it can be applied to sparse mostly immutable objects. We used HTML style as an example of such an object. Shared state object may work in many other areas, but for it to shine its use case should fit to the task.

Hash Code of Dictionary

Vladimir Nesterovsky — Mon, 13 Jan 2014 20:33:31 GMT

Dealing recently with some task (the same that inspired us to implement WeakTable), we were in a position to use a dictionary as a key in another dictionary.

What are the rules for the class to be used as key:

key should be immutable;
key should implement a GetHashCode() method;
key should implement a Equals() method.

The first requirement is usually implemented as a documentation contract like this:

As long as an object is used as a key in the Dictionary<TKey, TValue>, it must not change in any way that affects its hash value.

The third requirement about equals is trivially implemented as a method:

public bool Equals(IDictionary<K, V> x, IDictionary<K, V> y) { if (x == y) { return true; } if ((x == null) || (y == null) || (x.Count != y.Count)) { return false; } foreach(var entry in x) { V value; if (!y.TryGetValue(entry.Key, out value) || !valueComparer.Equals(entry.Value, value)) { return false; } } return true; }

But how would you implement hash code?

We argued like this.

1. Let's consider the dictionary as a sparse array of values with only populated items that correspond to key hash codes.

2. Hash code is constructed using some fair algorithm. E.g like that used in java to calculate string's hash code:

n-1 h(s) = SUM (s[i]*p^(n-1-i)) mod m, where m = 2^31 i=0

In our case:

n can be arbitrary large int value, so in fact it's 2^32;
items are enumerated in unknown order;
there is only limited set of items, so most s[i] are zeros.

As result we cannot use recurrent function to calculate a power p^k mod m. Fortunately one can build fast exponentiation arguing like this:

32/s - 1 p^k = p^ SUM 2^((s*i)*k[i]) mod m, where s some int: 1, 2, 4, 8, 16, or 32. i=0

Thus

32/s - 1 p^k = PRODUCT (p^(2^(s*i)))^k[i] mod m i=0

If s = 1 then k[i] is either 1 or 0 (a bit), and there is 32 different p^(2^i) mod m values, which can be precalculated.

On the other hand, if we select s = 8 we can write the formula as:

p^k = p^k[0] * (p^(2^8))^k[1] * (p^(2^16))^k[2] * (p^(2^24))^k[3] mod m

where k[i] is a 8 bit value (byte).

Precalculating all values p^n, (p^(2^8))^n, (p^(2^16))^n, (p^(2^24))^n for n in 0 to 255 we reach the formula with 4 multiplications and with 1024 precalculated values.

Here is the whole utility to calculate hash factors:

/// <summary> /// Hash utilities. /// </summary> public class Hash { /// <summary> /// Returns a P^value mod 2^31, where P is hash base. /// </summary> /// <param name="value">A value to get hash factor for.</param> /// <returns>A hash factor value.</returns> public static int GetHashFactor(int value) { return factors[(uint)value & 0xff] * factors[(((uint)value >> 8) & 0xff) | 0x100] * factors[(((uint)value >> 16) & 0xff) | 0x200] * factors[(((uint)value >> 24) & 0xff) | 0x300]; } /// <summary> /// Initializes hash factors. /// </summary> static Hash() { var values = new int[4 * 256]; var value = P; var current = 1; var i = 0; do { values[i++] = current; current *= value; } < 256); while(i value = current; current = 1; do { values[i++] = current; current *= value; } < 512); while(i value = current; current = 1; do { values[i++] = current; current *= value; } < 768); while(i value = current; current = 1; do { values[i++] = current; current *= value; } < 1024); while(i factors = values; } /// <summary> /// A base to calculate hash factors. /// </summary> public const int P = 1103515245; /// <summary> /// Hash factors. /// </summary> private static readonly int[] factors; }

With this API hash code for a dictionary is a trivial operation:

public int GetHashCode(IDictionary<K, V> dictionary) { if (dictionary == null) { return 0; } var result = 0; foreach(var entry in dictionary) { if ((entry.Key == null) || (entry.Value == null)) { continue; } result += Hash.GetHashFactor(keyComparer.GetHashCode(entry.Key)) * valueComparer.GetHashCode(entry.Value); } return result; }

And finally, here is a reference to a class DictionaryEqualityComparer<K, V>: IEqualityComparer<IDictionary<K, V>> that allows a dictionary to be a key in another dictionary.

Update

We have commited some tests, and have found that with suffiently "good" implementation of GetHashCode() of key or value we achieve results almost of the same quality, as the results of the algorithm we have outlined above with much simpler and straightforward algorithm like this:

public int GetHashCode(IDictionary<K, V> dictionary) { if (dictionary == null) { return 0; } var result = 0; foreach(var entry in dictionary) { if ((entry.Key == null) || (entry.Value == null)) { continue; } var k = entry.Key.GetHashCode(); var v = entry.Value.GetHashCode(); << 5) + k; << (k > k = (k v = (v > 3)) + v; result += k ^ v; //result += Hash.GetHashFactor(keyComparer.GetHashCode(entry.Key)) * // valueComparer.GetHashCode(entry.Value); } return result; }

It was worth to blog about this just to find out that we have outwitted ourselves, and finally to reach to a trivial hash code implementation for the dictionary.

WeakTable

Arthur Nesterovsky — Wed, 08 Jan 2014 21:57:16 GMT

Dealing recently with some task, we were in a position to use a weak dictionary in the .NET. Instinctively we assumed that it should exist somewhere in the standard library. We definitely knew that there is a WeakReference class to for a single instance. We also knew that there is WeakHashMap in java, and that it's based on java's WeakReference.

So, we were surprised to find that there is no such thing out of the box in .NET.

We have found that java's and .NET's weak references are different. In java weak references whose targets are GCed can be automatically put into a queue, which can be used to build clean up logic to remove dead keys from weak hash map. There is nothing similar in .NET, where weak reference just silently loses it's value.

Internet is full with custom implementations of weak dictionaries in .NET.

.NET 4.5 finally defines a class ConditionalWeakTable<TKey, TValue>, which solves the problem in case when you need to match keys by instance identity.

Unfortunately in our case we needed to match keys using key's GetHashCode() and Equals(). So, ConditionalWeakTable<TKey, TValue> did not directly work, but then we found a way to make it work for us.

Here is a quote from the definition:

A ConditionalWeakTable<TKey, TValue> object is a dictionary that binds a managed object, which is represented by a key, to its attached property, which is represented by a value. The object's keys are the individual instances of the TKey class to which the property is attached, and its values are the property values that are assigned to the corresponding objects.

...in the ConditionalWeakTable<TKey, TValue> class, adding a key/value pair to the table does not ensure that the key will persist, even if it can be reached directly from a value stored in the table... Instead, ConditionalWeakTable<TKey, TValue> automatically removes the key/value entry as soon as no other references to a key exist outside the table.

This property of ConditionalWeakTable<TKey, TValue> has helped us to build a way to get a notification when the key is being finalized, which is the missed ingredient in .NET's weak references.

Assume you have an instance key of type Key. To get a notification you should define a class Finalizer that will call some handler when it's finalized, and you should bind key and a finalizer instance using weak table.

The code looks like this:

public class Finalizer<K> where K: class { public static void Bind(K key, Action<K> handler) { var finalizer = table.GetValue(key, k => new Finalizer<K> { key = k }); finalizer.Handler += handler; } public static void Unbind(K key, Action<K> handler) { Finalizer finalizer; if (table.TryGetValue(key, out finalizer)) { finalizer.Handler -= handler; } } ~Finalizer() { var handler = Handler; if (handler != null) { handler(key); } } private event Action<K> Handler; private K key; private static readonly ConditionalWeakTable<K, Finalizer> table = new ConditionalWeakTable<K, Finalizer>(); } Key key = ... Finalizer.Bind(key, k => { /* clean up. */ });

Using this approach we have created a class WeakTable<K, V> modeled after ConditionalWeakTable<TKey, TValue>.

So, this is our take in the problem: WeakTable.cs.

Windows: so far so worse

Arthur Nesterovsky — Mon, 04 Nov 2013 11:20:28 GMT

A day ago we had installed on our laptops Win 8.1. We thought it will be better than Win 8, but now we afraid of next release of Windows... So far, so worse.

ICS became unusable: in Windows 7 we've shared Internet connection from USB 3G dongle between our two computers, in Windows 8 we've succeeded to share only file system access, but not Internet. In Windows 8.1 neither Internet nor file system are accessible...

String comparision: SQL vs COBOL

Vladimir Nesterovsky — Mon, 14 Oct 2013 20:23:11 GMT

Till recently we were living in simple world of string comparisons in SQL style, and now everything has changed.

From the university years we knew that strings in SQL are compared by first trimming traling spaces, and then comparing in C style.

Well, the picture was a little more complex, as collations were involved (national, case sensivity), and as different SQL vendors implemented it differently.

Next,
we're dealing with programs converted from COBOL, which we originally thought follow SQL rules when strings are compared.

Here is where the problem has started.

Once we have found that java program has branched differently than original COBOL, and the reason was that the COBOL and java compared two strings differently:

COBOL: "A\n" < "A";
Java: "A\n" > "A"

We have looked into COBOL Language Reference and found the rules:

Operands of equal size

Characters in corresponding positions of the two operands are compared, beginning with the leftmost character and continuing through the rightmost character.

If all pairs of characters through the last pair test as equal, the operands are considered as equal.

If a pair of unequal characters is encountered, the characters are tested to determine their relative positions in the collating sequence. The operand that contains the character higher in the sequence is considered the greater operand.

Operands of unequal size

If the operands are of unequal size, the comparison is made as though the shorter operand were extended to the right with enough spaces to make the operands equal in size.

You can see that strings must not be trimmed but padded with spaces to the longer string, and only then they are compared. This subtle difference has significant impact for characters below the space.

So, here we've found that COBOL and SQL comparisons are different.

But then we have questioned how really SQL beheaves?

We've tested comparisons in SQL Server and DB2, and have seen that our understanding of SQL comparison holds. It works as if trimming spaces, and then comparing.

But again we have looked into SQL-92 definition, and that's what we see there:

8.2 <comparison predicate>
3) The comparison of two character strings is determined as follows:

a) If the length in characters of X is not equal to the length in characters of Y, then the shorter string is effectively replaced, for the purposes of comparison, with a copy of itself that has been extended to the length of the longer string by concatenation on the right of one or more pad characters, where the pad character is chosen based on CS. If CS has the NO PAD attribute, then the pad character is an implementation-dependent character different from any character in the character set of X and Y that collates less than any string under CS. Otherwise, the pad character is a <space>.

So, what we see is that SQL-92 rules are very close to COBOL rules, but then we reach the question: how come that at least SQL Server and DB2 implement string comparison differently than SQL-92 dictates?

Update: we have found that both SQL Server and DB2 have their string collation defined in a way that <space> is less than any other character. So the following is always true: '[' + char(13) + ']' > '[ ]'.

On Garbage Collectors

Vladimir Nesterovsky — Sun, 14 Jul 2013 12:20:29 GMT

Recently we've seen an article Why mobile web apps are slow.

While title mentions web apps, the the criticism is directed purely to javascript language. The answer presented there is twofold:

Raw javascript performance is ~5 times slower than performance of native code solving the same task.
This accounts for the most modern implementations that exploit JIT. Author does not expect that this proportion will be significatly changed in javascript's favor in the near future.
Garbage Collection, which is essential part of javascript, does not work well in constrainted environment of mobile devices.
Here author quotes many references that show that:
- for GC to work on peer with non-GC application, it needs to have ~6 - 8 times size of memory than an application needs;
- at the same time for hardware reasons, mobile devices cannot provide such amount of memory;
- on the other hand with rise of CPU performance, GC pressure rises even faster.

In the end author, while saying about some attempts to change the state, has no final verdict, whether there can be anything done to remedy the problem.

Having roots in C++, we're GC unbelievers. But you know, who will ask your opinion on that, while there are all those modern languages that try to abstract from memory and implicitly or explicitly assume GC: java, C#, javascript, xslt, xquery, and so on.

There always is a solution to avoid GC completely, like C++ and other (e.g. Microsoft's C++/CX, or Apple's ARC) do. But, assuming you're entered GC world, what can be done with it? How do you make it more predictable, less greedy, and probably more fast?

Our arguments are like this.

How does native code manage object graphs?

Today's solution is reference counting along with weak references to break cycles in graph.

Can be GC based on this?

Yes.

In fact upcoming spec of javascript contains weak references. So, provided a developer accurately defines relations in an object graph, one may immediately achieve the same efficiency as native solution.

If one does not use weak references consistently then object cycles can be created, so there can appear a graph that is not referenced anywhere in a program. This graph can be collected with classical GC that scans object roots.

Classical GC part can be used as a debug mode leak detector, and will collect graph cycles at runtime.

Thus, we claim that a hybrid memory management: reference counting with weak references plus classical GC is possible; it will be equal to a native memory management when weak references are properly used, and it will be close to classical GC without use of weak references.

This solution gives a rule to mitigate GC drawbacks: just use weak references in appropriate points, and you can continue to live in GC world, where GC is only a fallback. >