A database we support for a client contains multi-billion row tables. Many
users query the data from that database, and it's permanently populated
with a new data.
Every day we load several millions rows of a new data. Such loads can lock tables for a
considerable time, so our loading procedures collect new data into intermediate
tables and insert it into a final destination by chunks, and usually after work
hours.
SQL Server 2008 R2 introduced
READ_COMMITTED_SNAPSHOT database option. This feature trades locks for an
increased tempdb size (to store row versions) and possible performance
degradation during a transaction.
When we have switched the database to that option we did
not notice any considerable performance change. Encouraged, we've decided to
increase size of chunks of data we insert at once.
Earlier we have found that when we insert no more than 1000 rows
at once, users don't notice impact, but for a bigger chunk sizes users start to
complain on performance degradation. This has probably happened due to locks
escalations.
Now, with chunks of 10000 or even 100000 rows we have found that no queries
became slower. But load process became several times faster.
We were ready to pay for increased tempdb and transaction log size to increase
performance, but in our case we didn't approach limits assigned by the DBA.
Another gain is that we can easily load data at any time. This makes data we
store more up to date.
Yesterday, by accident, we've seen an article about some design principles
of V8 JavaScript Engine.
It made clearer what techniques are used in today's script implementations.
In particular V8 engine optimizes property access using "dynamically
created
hidden classes". These are structures to store object's layout, they
are derived when ש new property is created (deleted) on the object. When code
accesses a property, and if a cached object's dynamic hidden class is available
at the code point then access time is comparable to one of native fields.
In our opinion this tactics might lead to a proliferation of such dynamic hidden
classes, which requires a considerable housekeeping, which also slows property
write access, especially when it's written for the first time.
We would like to suggest a slightly different strategy, which exploits the
cache matches, and does not require a dynamic hidden classes.
Consider an implementation data type with following characteristics:
- object is implemented as a hash map of property id to property value: Map<ID, Value>;
- it stores data as an array of pairs and can be accessed directly: Pair<ID,
Value> values[];
- property index can be acquired with a method: int index(ID);
A pseudo code for the property access looks like this:
pair = object.values[cachedIndex];
if (pair.ID == propertyID)
{
value = pair.Value;
}
else
{
// Cache miss.
cachedIndex = object.index(propertyID);
value = objec.values[cachedIndex].Value;
}
This approach brings us back to dictionary like implementation but with
important optimization of array speed access when property index is cached, and
with no dynamic hidden classes.
@michaelhkay Saxon 9.4 is out.
But why author does not state that HE version is still xslt/xpath 2.0, as neither xslt maps, nor function items are supported.
Recently, we have found and reported the bug in the SQL Server 2008 (see SQL Server 2008 with(recompile), and also Microsoft Connect).
Persons, who's responsible for the bug evaluation has closed it, as if "By Design". This strange resolution, in our opinion, says about those persons only.
Well, we shall try once more (see Microsoft Connect). We have posted another trivial demonstartion of the bug, where we show that option(recompile) is not used, which leads to table scan (nothing worse can happen for a huge table).
A bit history: the first release of this solution was about 9.5 years ago...
Today we've run into a strange situation. One of our clients ask us about automatic conversion of data from mainframe (that were defined as COBOL copybooks) into XML or Java/.NET objects. On our suggestion to use eXperanto, which is well known to him, he stated that he wouldn't like to use a tool of a company that is no more exists...
The situation, in our opinion, become more strange when you consider the following:
- eXperanto (the design-time tool and run-time libraries for Java and .NET) were developed, well tested, and delivered by us to production already several years ago.
- the client bought this set (the tool and libraries).
- the set is in production yet already in another big company, and is used time to time by our company in different migration projects.
- the client talks with developers of this tool and run-time libraries, and he knows about this fact.
- the client uses widely open source solutions even without dedicated vendors or support warranties.
It has happened so, that we have never worked with jQuery, however were aware of
it.
In early 2000 we have developed a web application that contained rich javascript
APIs, including UI components. Later, we were actively practicing in ASP.NET, and
later in JSF.
At present, looking at jQuery more closely we regret that we have failed to
start using it earlier.
Separation of business logic and presentation is remarkable when one uses JSON
web services. In fact server part can be seen as a set of web services
representing a business logic and a set of resources: html, styles, scripts,
others. Nor ASP.NET or JSF approach such a consistent separation.
The only trouble, in our opinion, is that jQuery has no standard data binding: a way to bind JSON data
to (and from) html controls. The technique that will probably be standardized is called jQuery Templates or JsViews
.
Unfortunatelly after reading about this
binding API, and
being in love with Xslt and XQuery we just want to cry. We don't know what would
be the best solution for the task, but what we see looks uncomfortable to us.
Incidentally, we have found one new implementation of yield return in java that is
in the development stage. Sources can be found at
https://github.com/peichhorn/lombok-pg/zipball/master. Just to be sure we
have copied those sources at other place
peichhorn-lombok-pg-0.10.0-39-g384fb7b.zip (you may search "yield"
in the archive).
It's broken according to source tracker, but the funny thing is that sources,
however different, still resemble our yield return implementation (Yield.jar,
Yield.3.7.jar
- Indigo, Yield.zip
- sources) very much: variable names, error messages, algorithmic structure.
Those programmers probably have forgotten good manners: to reference a base work,
at least.
Well, we generously forgive them this blunder.
P.S. our implementation, in contrast, works without bugs.
P.P.S. misunderstanding is resolved. See comments.
A couple of weeks ago, we have suggested to introduce a enumerator function into
the XPath (see
[F+O30] A enumerator function):
I would like the WG to consider an addition of a function that turns a sequence
into a enumeration of values.
Consider a function like this:
fn:enumerator($items as item()*) as function() as item()?;
alternatively, signature could be:
fn:enumerator($items as function() as item()*) as function() as item()?;
This function receives a sequence, and returns a function item, which upon N's
call shall return N's element of the original sequence. This way, a sequence of
items is turned into a function providing a enumeration of items of the
sequence.
As an example consider two functions:
a) t:rand($seed as xs:double) as xs:double* - a function producing a random
number sequence;
b) t:work($input as element()) as element() - a function that generates output
from it's input, and that needs random numbers in the course of the execution.
t:work() may contain a code like this:
let $rand := fn:enumerator(t:rand($seed)),
and later it can call $rand() to get a random numbers.
Enumerators will help to compose algorithms where one algorithm communicate
with other independant algorithms, thus making code simpler. The most obvious
class of enumerators are generators: ordered numbers, unique identifiers,
random numbers.
Technically, function returned from fn:enumerator() is nondetermenistic, but its "side effect" is
similar to a "side effect" of a function generate-id() from a newly created
node (see bug #13747, and bug #13494).
The idea is inspired by a generator function, which returns a new value upon each
call.
Such function can be seen as a stateful object. But our goal is to look at
it in a more functional way. So, we look at the algorithm as a function that
produces a sequence of output, which is pure functional; and an enumerator that
allows to iterate over algorithm's output.
This way, we see the function that implements an algorithm and the function that
uses it can be seen as two thread of functional programs that use messaging to
communicate to each other.
Honestly, we doubt that WG will accept it, but it's interesting to watch the
discussion.
More than month has passed since we have reported a problem to the saxon forum (see
Saxon optimizer bug and
Saxon 9.2 generate-id() bug).
The essence of the problem is that we have constructed argumentless function to
return a unique identifiers each time function is called. To achieve the effect
we have created a temporary node and returned its generate-id() value.
Such a function is nondetermenistic, as we cannot state that its result depends
on arguments only. This means that engine's optimizer is not free to reorder
calls to such a function. That's what happens in Saxon 9.2, and Saxon 9.3 where
engine elevates function call out of cycle thus producing invalid results.
Michael Kay, the author of the Saxon engine, argued that this is "a gray area of
the xslt spec":
If the spec were stricter about defining exactly when you can rely on identity-dependent
operations then I would be obliged to follow it, but I think it's probably deliberate
that it currently allows implementations some latitude, effectively signalling to
users that they should avoid depending on this aspect of the behaviour.
He adviced to raise a bug in the w3c bugzilla to resolve the issue. In the end
two related bugs have been raised:
- Bug 13494
- Node
uniqueness returned from XSLT function;
- Bug 13747
- [XPath 3.0] Determinism of expressions returning constructed nodes.
Yesterday, the WG has resolved the issue:
The Working Group agreed that default behavior should continue to require these
nodes to be constructed with unique IDs.
We believe that this is the kind of thing implementations can do with
annotations or declaration options, and it would be best to get implementation
experience with this before standardizing.
This means that the technique we used to generate unique identifiers is correct
and the behaviour is well defined.
The only problem is to wait when Saxon will fix its behaviour accordingly.
We're not big fans of
Entity Framework, as we don't directly expose the database structure to
the client program but rather through stored procedures and functions. So, EF for
us is a tool to expose those stored procedures as .NET wrappers. This limited use
of EF still greatly automates the data access code.
But what we have lately found is that the EF has a problem with char parameters. Namely,
if you import a procedure say MyProc that accepts char(1),
and then will call it through the generated wrapper, the you will see in sql profiler
that char(1) parameter is passed with many trailing spaces as if it
were char(8000). There isn't necessity to prove that this is highly
ineffective.
We can see that the problem happens in VS 2010 designer rather than in the EF runtime,
as SP's parameters are not attributed with length, see model xml (*.edmx):
<Function Name="MyProc" Schema="Data">
...
<Parameter Name="recipientType" Type="char" Mode="In"
/>
...
</Function>
while if we set:
<Parameter Name="recipientType" Type="char" MaxLength="1"
Mode="In" />
the runtime starts working as expected. So the workaround is to fix model file manually.
See also:
Stored Proc and Char parm
1. query.dll vs tquery.dll
We have installed
Windows Search 4 on a Windows 2003 server. The goal was to index huge compressed
xml files (see
Windows Search Notifications). But for some reason it did not want to index
content.
No "select System.ItemUrl from SystemIndex where contains('...')"
has ever returned a row.
We thought that the problem was in our protocol handler, and tried to localize it,
but finally have discovered that Windows Search is not able to find anything within
text files.
Registry comparision has shown that *.txt extension was indexed by the IFilter defined
in the query.dll, while on the other computers, where everything worked, the implementation
was in the tquery.dll.
Both libraries were present on the Windows 2003 server, so we have corrected the
registry and everything has started to work.
As far as we understand query.dll is part of legacy
Indexing Service, and tquery.dll is up to date implementation.
2. Search index size
We have to index a considerable amout of data. But before we can do it we have to
estimate the size of index.
In the past it seems we saw somewhere a statement that search index needs a storage
that's about 10% of original data for its purposes. Unfortunatelly we cannot
find this estimation at present, neither we cannot find any other estimation. This
complicates our planning.
To get empirical estimate we've indexed several thousands *.xml-gz files, which
are gz'ed big xmls. The total size of this files is about 4.5GB. Total uncompressed
size of xmls ~50GB. Xml contained about 10 millions pages of data.
According to 10% criteria we had to arrive to ~5GB search index.
But what we have discovered is that the index has grown to more than 50GB. That's
very disappointing. We cannot afford such expense, as we've commited test on
a tiny part of data, which increases over time.
So, the solution is to find out what's wrong, and how can it be cured, or to
fulltext index only most recent subset of data.
P.S. We have tried to mark folder with search index as compressed, but it did not
work.
P.P.S. We have found the reference to Windows Search 4 index size estimation. It is in
Windows Search Frequently Asked Questions, see answer on "What is average size of a user's index?" question.
We needed to track a stream position during creation of xml file. This is to
allow random access to a huge xml file (the task is related to
WindowsSearch).
This is a simplified form of the xml:
<data>
<item>...</item>
...
<item>...</item>
</data>
The goal was to have stream position of each item element. With this in mind,
we've decided to:
- open a stream, and then xml writer over it;
- write data into xml writer;
- call
Flush() method of the xml writer before measuring stream offset;
That's a code sample:
var stream = new MemoryStream();
var writer = XmlWriter.Create(stream);
writer.WriteStartDocument();
writer.WriteStartElement("data");
for(var i = 0; i < 10; ++i)
{
writer.Flush();
Console.WriteLine("Flush offset: {0}, char: {1}",
stream.Position,
(char)stream.GetBuffer()[stream.Position - 1]);
writer.WriteStartElement("item");
writer.WriteValue("item " + i);
writer.WriteEndElement();
}
writer.WriteEndElement();
writer.WriteEndDocument();
That's the output:
Flush offset: 46, char: a
Flush offset: 66, char: >
Flush offset: 85, char: >
Flush offset: 104, char: >
Flush offset: 123, char: >
Flush offset: 142, char: >
Flush offset: 161, char: >
Flush offset: 180, char: >
Flush offset: 199, char: >
Flush offset: 218, char: >
Funny, isn't it?
After feeding the start tag <data>, and flushing xml writer we observe that only
"<data" has been written down to the stream. Well,
Flush() have never promissed anything particular about the content
of the stream, so we cannot claim any violation, however we expected to see
whole start tag.
Inspection of the implementation of xml writer reveals laziness during writting
data down the stream. In particular start tag is closed when one starts the
content. This is probably to implement empty tags: <data/>.
To do the trick we had to issue empty content, moreover, to call a particular
method with particular parameters of the xml writer. So the code after the fix
looks like this:
var stream = new MemoryStream();
var writer = XmlWriter.Create(stream);
writer.WriteStartDocument();
writer.WriteStartElement("data");
char[] empty = { ' ' };
for(var i = 0; i < 10; ++i)
{
writer.WriteChars(empty, 0, 0);
writer.Flush();
Console.WriteLine("Flush offset: {0}, char: {1}",
stream.Position,
(char)stream.GetBuffer()[stream.Position - 1]);
writer.WriteStartElement("item");
writer.WriteValue("item " + i);
writer.WriteEndElement();
}
writer.WriteEndElement();
writer.WriteEndDocument();
And output is:
Flush offset: 47, char: >
Flush offset: 66, char: >
Flush offset: 85, char: >
Flush offset: 104, char: >
Flush offset: 123, char: >
Flush offset: 142, char: >
Flush offset: 161, char: >
Flush offset: 180, char: >
Flush offset: 199, char: >
Flush offset: 218, char: >
While this code works, we feel uneasy with it.
What's the better way to solve the task?
Update: further analysis shows that it's
only possible behaviour, as after the call to write srart element, you either
can write attributes, content or end of element, so writer may write either
space, '>' or '/>'. The only
question is why it takes WriteChars(empty, 0, 0) into account and WriteValue("")
it doesn't.
As you probably know we have implemented our custom Protocol Handler for the Windows
Search.
It's called .xml-gz, and has a goal to index compressed xml files and to have
search results with a subtree precision. So, for xml:
<data>
<item>...</item>
<item>...</item>
...
</data>
search finds results within item and returns xml's url and stream
offset of the item. Using ZLIB API we can compress data with stream bookmarks, so fast random
access to the data is possible.
The only problem we have is about notification of changes (create, delete, update)
of such files.
Spec describes several techniques (nothing has worked for us):
1. Call catalogManager.ReindexMatchingURLs()
- it just returns without any impact.
2.Call changeSink.OnItemsChanged()
- returns error.
3. Implement
.xml-gz IFilter and call IGatherNotifyInline (see "
have your .zip urls indexed when they are created or modified") -
that's a mistery, as:
4. Implement root url in form .xml-gz:/// and perform Windows Search:
SELECT
System.ItemUrl, System.DateModified
FROM
SystemIndex WHERE System.FileExtension='.xml-gz'
to find all .xml-gz sources. This is not reliable, as your protocol handler can
be (and is) called before file is indexed.
So, the only reliable way to index your data is to (re-)add indexing rule for
the protocol handler, which in most cases reindexes everything.
The only bearable solution we found is to define indexing rule in the form:
.xml-gz://file:d:/data/... and to use
IShellFolder(2)
interfaces to discover sub items and their modification times. This technique allows
minimal data scan when you're (re-)add indexing rule.
Being unexperienced with Windows Search we tried to build queries to find data in the huge storage. We needed to find a document that matches some name pattern and contains some text.
Our naive query was like this:
select top 1000 System.ItemUrl from SystemIndex where scope = '...' and System.ItemName like '...%' and contains('...')
In most cases this query returns nothing and runs very long. It's interesting to note that it may start returning data if "top" clause is missing or uses a bigger number, but in this cases query is slower even more.
Next try was like this:
select top 1000 System.ItemUrl from SystemIndex where scope = '...' and System.ItemName >= '...' and System.ItemName < '...' and contains('...')
This query is also slow, but at least it returns some results.
At some point we have started to question the utility of Windows Search if it's so slow, but then we have found that there is a property System.ItemNameDisplay, which in our case coincides with the value of property System.ItemName, so we have tried the query:
select top 1000 System.ItemUrl from SystemIndex where scope = '...' and System.ItemNameDisplay like '...%' and contains('...')
This query worked fast, and produced good results. This hints that search engine has index on System.ItemNameDisplay in contrast to System.ItemName property.
We've looked at property definitions:
System.ItemNameDisplay
The display name in "most complete" form. It is the unique representation of the item name most appropriate for end users.
propertyDescription name = System.ItemNameDisplay shellPKey = PKEY_ItemNameDisplay formatID = B725F130-47EF-101A-A5F1-02608C9EEBAC propID = 10 searchInfo inInvertedIndex = true isColumn = true isColumnSparse = false columnIndexType = OnDisk maxSize = 128
System.ItemName
The base name of the System.ItemNameDisplay property.
propertyDescription name = System.ItemName shellPKey = PKEY_ItemName formatID = 6B8DA074-3B5C-43BC-886F-0A2CDCE00B6F propID = 100 searchInfo inInvertedIndex = false isColumn = true isColumnSparse = false columnIndexType = OnDisk maxSize = 128
Indeed, one property is indexed, while the other is not.
As with other databases, query is powerful when engine uses indices rather than performs data scan. This is also correct for Windows Search.
The differences in results that variations of query produce also manifests that Windows Search nevertheless is very different from relational database.
We have developed our custom Windows Search Protocol Handler. The role of this component is to expose items of complex content (or unusual storage) to Windows Search.
You can think of some virtual folder, so a Protocol Handler allows to enumerate it's files, file properties, and contents.
The goal of our Protocol Handler is to represent some data structure as a set of xml files. We expected that if we found a data within a folder with these files, then a search within Protocol Handler's scope would bring the same (or almost the same) results.
Reality is different.
For some reason .xml IFilter (a component to extract text data to index) works differently with file system and with our storage. We cannot state that it does not work, but for some reason many words that Windows Search finds within a file are never found within Protocol Handler scope.
We have observed that if, for purpose of indexing, we represent content xml items as .txt files, then search works as expected. So, our workaround was to present only xml's text data for the indexing, and to use .txt IFilter (this in fact roughly what .xml IFilter does by itself).
Is there a conclusion?
Well, Windows Search is a black box probably containing bugs. Its behaviour is not always obvious.
Let's put it blatantly: Windows Search 4 has design and implementation problems.
You discover this immediatelly when you start implementing indexing of custom file format.
If you want to index simple file format then you need to imlement you IFilter interface. But if it has happened so that you want to index compound data then you should invent you own protocols.
If you will fugure out how to implement your protocol to index that compound data, then you will most probably stuck on the problem on how to notify indexer about the changes.
The problem is that Windows Search 4 has API to reindex urls, which simply does not work, or to notify indexer about changes, which throws an error (returns error HRESULT) for custom protocols. At least, we were not able to make it run.
Do you know that the best JSF/Facelets visual editor, in our opinion, is ... Microsoft Visual Studio 2008? Another rather good JSF editor is presented in IBM RAD 7.xx. The most popular, open source Java IDE Eclipse contains an ugly implementation of such useful thing.
Already for a couple of days we're trying to create a UserControl containing a
TabContainer. To achieve the goal we have
created a page with a ToolkitScriptManager and a user control itself.
Page:
<form runat="server">
<ajax:ToolkitScriptManager ID="ToolkitScriptManager1" runat="server"/>
<uc1:WebUserControl1 ID="WebUserControl11" runat="server" />
</form>
User control:
<%@ Control Language="C#" %>
<%@ Register
Assembly="AjaxControlToolkit"
Namespace="AjaxControlToolkit"
TagPrefix="ajax" %>
<ajax:TabContainer ID="Tab" runat="server" Width="100%">
<ajax:TabPanel runat="server" HeaderText="Tab1" ID="Tab1">
<ContentTemplate>Panel 1</ContentTemplate>
</ajax:TabPanel>
</ajax:TabContainer>
What could be more simple?
But no, there is a problem. At run-time
control works perfectly, but at the
designer it shows an error instead of a normal design view:
Error Rendering Control - TabContainer1
An unhandled exception has occurred.
Could not find any resources appropriate for the specified culture or the neutral culture. Make sure "AjaxControlToolkit.Properties.Resources.NET4.resources" was correctly embedded or linked into assembly "AjaxControlToolkit" at compile time, or that all the satellite assemblies required are loadable and fully signed.
That's a stupid error, which says nothing about the real problem reason. We had
to attach a debugger to a Visual Studio just to realize what the problem is.
So, the error occurs at the following code of AjaxControlToolkit.ScriptControlBase:
private void EnsureScriptManager()
{
if (this._scriptManager == null)
{
this._scriptManager = ScriptManager.GetCurrent(this.Page);
if (this._scriptManager == null)
{
throw new HttpException(Resources.E_NoScriptManager);
}
}
}
Originally, the problem is due to the fact that ScriptManager is not found, and code
wants to report an HttpException, but fun is that we recieve a different exception, which is releted to a missing resouce text for a message Resources.E_NoScriptManager.
It turns out that E_NoScriptManager text is found neither in
primary no in resource assemblies.
As for original problem, it's hard to say about reason of why ScriptManager is
not available at design time. We, however, observed that a ScriptManager
registers itself for a ScriptManager.GetCurrent() method at
run-time only:
protected internal override void OnInit(EventArgs e)
{
...
if (!base.DesignMode)
{
...
iPage.Items[typeof(ScriptManager)] = this;
...
}
}
So, it's not clear what they (toolkit's developers) expected to get at design
time.
These observations leave uneasiness regarding the quality of the library.
Earlier, we have described an approach to call Windows Search from SQL Server 2008. But it has turned out that our problem is more complicated...
All has started from the initial task:
- to allow free text search in a store of huge xml files;
- files should be compressed, so these are *.xml.gz;
- search results should be addressable to a fragment within xml.
Later we shall describe how we have solved this task, and now it's enough to say that we have implemented a Protocol Handler for Windows Search named '.xml-gz:'. This way original file stored say at 'file:///c:/store/data.xml-gz' is seen as a container by the Windows Search:
- .xml-gz:///file:c:/store/data.xml-gz/id1.xml
- .xml-gz:///file:c:/store/data.xml-gz/id2.xml
- ...
This way search xml should be like this:
select System.ItemUrl from SystemIndex where scope='.xml-gz:' and contains(...)
Everything has worked during test: we have succeeded to issue Windows Search selects from SQL Server and join results with other sql queries.
But later on when we considered a runtime environment we have seen that our design won't work. The reason is simple. Windows Search will work on a computer different from those where SQL Servers run. So, the search query should look like this:
select System.ItemUrl from Computer.SystemIndex where scope='.xml-gz:' and contains(...)
Here we have realized the limitation of current (Windows Search 4) implementation: remote search works for shared folders only, thus query may only look like:
select System.ItemUrl from Computer.SystemIndex where scope='file://Computer/share/' and contains(...)
Notice that search restricts the scope to a file protocol, this way remoter search will never return our results. The only way to search in our scope is to perform a local search.
We have considered following approaches to resolve the issue.
The simplest one would be to access Search protocol on remote computer using a connection string: "Provider=Search.CollatorDSO;Data Source=Computer" and use local queries. This does not work, as provider simply disregards Data Source parameter.
The other try was to use MS Remote OLEDB provider. We tried hard to configure it but it always returns obscure error, and more than that it's deprecated (Microsoft claims to remove it in future).
So, we decided to forward request manually:
- SQL Server calls a web service (through a CLR function);
- Web service queries Windows Search locally.
Here we considered WCF Data Services and a custom web service.
The advantage of WCF Data Services is that it's a technology that has ambitions of a standard but it's rather complex task to create implementation that will talk with Windows Search SQL dialect, so we have decided to build a primitive http handler to get query parameter. That's trivial and also has a virtue of simple implementation and high streamability.
So, that's our http handler (WindowsSearch.ashx):
<%@ WebHandler Language="C#" Class="WindowsSearch" %>
using System; using System.Web; using System.Xml; using System.Text; using System.Data.OleDb;
/// <summary> /// A Windows Search request handler. /// </summary> public class WindowsSearch: IHttpHandler { /// <summary> /// Handles the request. /// </summary> /// <param name="context">A request context.</param> public void ProcessRequest(HttpContext context) { var request = context.Request; var query = request.Params["query"]; var response = context.Response;
response.ContentType = "text/xml"; response.ContentEncoding = Encoding.UTF8;
var writer = XmlWriter.Create(response.Output);
writer.WriteStartDocument(); writer.WriteStartElement("resultset");
if (!string.IsNullOrEmpty(query)) { using(var connection = new OleDbConnection(provider)) using(var command = new OleDbCommand(query, connection)) { connection.Open();
using(var reader = command.ExecuteReader()) { string[] names = null;
while(reader.Read()) { if (names == null) { names = new string[reader.FieldCount];
for (int i = 0; i < names.Length; ++i) { names[i] = XmlConvert.EncodeLocalName(reader.GetName(i)); } }
writer.WriteStartElement("row");
for(int i = 0; i < names.Length; ++i) { writer.WriteElementString( names[i], Convert.ToString(reader[i])); }
writer.WriteEndElement(); } } } }
writer.WriteEndElement(); writer.WriteEndDocument();
writer.Flush(); }
/// <summary> /// Indicates that a handler is reusable. /// </summary> public bool IsReusable { get { return true; } }
/// <summary> /// A connection string. /// </summary> private const string provider = "Provider=Search.CollatorDSO;" + "Extended Properties='Application=Windows';" + "OLE DB Services=-4"; }
And a SQL CLR function looks like this:
using System; using System.Collections; using System.Collections.Generic; using System.Data; using System.Data.SqlClient; using System.Data.SqlTypes; using Microsoft.SqlServer.Server; using System.Net; using System.IO; using System.Xml;
/// <summary> /// A user defined function. /// </summary> public class UserDefinedFunctions { /// <summary> /// A Windows Search returning result as xml strings. /// </summary> /// <param name="url">A search url.</param> /// <param name="userName">A user name for a web request.</param> /// <param name="password">A password for a web request.</param> /// <param name="query">A Windows Search SQL.</param> /// <returns>A result rows.</returns> [SqlFunction( IsDeterministic = false, Name = "WindowsSearch", FillRowMethodName = "FillWindowsSearch", TableDefinition = "value nvarchar(max)")] public static IEnumerable Search( string url, string userName, string password, string query) { return SearchEnumerator(url, userName, password, query); }
/// <summary> /// A filler of WindowsSearch function. /// </summary> /// <param name="value">A value returned from the enumerator.</param> /// <param name="row">An output value.</param> public static void FillWindowsSearch(object value, out string row) { row = (string)value; }
/// <summary> /// Gets a search row enumerator. /// </summary> /// <param name="url">A search url.</param> /// <param name="userName">A user name for a web request.</param> /// <param name="password">A password for a web request.</param> /// <param name="query">A Windows Search SQL.</param> /// <returns>A result rows.</returns> private static IEnumerable<string> SearchEnumerator( string url, string userName, string password, string query) { if (string.IsNullOrEmpty(url)) { throw new ArgumentException("url"); }
if (string.IsNullOrEmpty(query)) { throw new ArgumentException("query"); }
var requestUrl = url + "?query=" + Uri.EscapeDataString(query);
var request = WebRequest.Create(requestUrl);
request.Credentials = string.IsNullOrEmpty(userName) ? CredentialCache.DefaultCredentials : new NetworkCredential(userName, password);
using(var response = request.GetResponse()) using(var stream = response.GetResponseStream()) using(var reader = XmlReader.Create(stream)) { bool read = true;
while(!read || reader.Read()) { if ((reader.Depth == 1) && reader.IsStartElement()) { // Note that ReadInnerXml() advances the reader similar to Read(). yield return reader.ReadInnerXml();
read = false; } else { read = true; } } } } }
And, finally, when you call this service from SQL Server you write query like this:
with search as ( select cast(value as xml) value from dbo.WindowsSearch ( N'http://machine/WindowsSearchService/WindowsSearch.ashx', null, null, N' select "System.ItemUrl" from SystemIndex where scope=''.xml-gz:'' and contains(''...'')' ) ) select value.value('/System.ItemUrl[1]', 'nvarchar(max)') from search
Design is not trivial but it works somehow.
After dealing with all these problems some questions remain unanswered:
- Why SQL Server does not allow to query Windows Search directly?
- Why Windows Search OLEDB provider does not support "Data Source" parameter?
- Why Windows Search does not support custom protocols during remote search?
- Why SQL Server does not support web request/web services natively?
We were trying to query Windows Search from an SQL Server 2008.
Documentation states that Windows Search is exposed as OLE DB datasource. This meant that we could just query result like this:
SELECT * FROM OPENROWSET( 'Search.CollatorDSO.1', 'Application=Windows', 'SELECT "System.ItemName", "System.FileName" FROM SystemIndex');
But no, such select never works. Instead it returns obscure error messages:
OLE DB provider "Search.CollatorDSO.1" for linked server "(null)" returned message "Command was not prepared.". Msg 7399, Level 16, State 1, Line 1 The OLE DB provider "Search.CollatorDSO.1" for linked server "(null)" reported an error. Command was not prepared. Msg 7350, Level 16, State 2, Line 1 Cannot get the column information from OLE DB provider "Search.CollatorDSO.1" for linked server "(null)".
Microsoft is silent about reasons of such behaviour. People came to a conclusion that the problem is in the SQL Server, as one can query search results through OleDbConnection without problems.
This is very unfortunate, as it bans many use cases.
As a workaround we have defined a CLR function wrapping Windows Search call and returning rows as xml fragments. So now the query looks like this:
select value.value('System.ItemName[1]', 'nvarchar(max)') ItemName, value.value('System.FileName[1]', 'nvarchar(max)') FileName from dbo.WindowsSearch('SELECT "System.ItemName", "System.FileName" FROM SystemIndex')
Notice how we decompose xml fragment back to fields with the value() function.
The C# function looks like this:
using System; using System.Collections; using System.IO; using System.Xml; using System.Data; using System.Data.SqlClient; using System.Data.SqlTypes; using System.Data.OleDb;
using Microsoft.SqlServer.Server;
public class UserDefinedFunctions { [SqlFunction( FillRowMethodName = "FillSearch", TableDefinition="value xml")] public static IEnumerator WindowsSearch(SqlString query) { const string provider = "Provider=Search.CollatorDSO;" + "Extended Properties='Application=Windows';" + "OLE DB Services=-4";
var settings = new XmlWriterSettings { Indent = false, CloseOutput = false, ConformanceLevel = ConformanceLevel.Fragment, OmitXmlDeclaration = true };
string[] names = null;
using(var connection = new OleDbConnection(provider)) using(var command = new OleDbCommand(query.Value, connection)) { connection.Open();
using(var reader = command.ExecuteReader()) { while(reader.Read()) { if (names == null) { names = new string[reader.FieldCount];
for (int i = 0; i < names.Length; ++i) { names[i] = XmlConvert.EncodeLocalName(reader.GetName(i)); } }
var stream = new MemoryStream(); var writer = XmlWriter.Create(stream, settings);
for(int i = 0; i < names.Length; ++i) { writer.WriteElementString(names[i], Convert.ToString(reader[i])); }
writer.Close();
yield return new SqlXml(stream); } } } }
public static void FillSearch(object value, out SqlXml row) { row = (SqlXml)value; } }
Notes:
- Notice the use of "
OLE DB Services=-4" in provider string to avoid transaction enlistment (required in SQL Server 2008).
- Permission level of the project that defines this extension function should be set to unsafe (see Project Properties/Database in Visual Studio) otherwise it does not allow the use OLE DB.
- SQL Server should be configured to allow CLR functions, see Server/Facets/Surface Area Configuration/ClrIntegrationEnabled in Microsoft SQL Server Management Studio
- Assembly should either be signed or a database should be marked as trustworthy, see Database/Facets/Trustworthy in Microsoft SQL Server Management Studio.
Last few days we were testing Java web-applications that expose web-services. During these tests we've found few interesting features.
The first feature allows to retrieve info about all endpoints supported by the web-application on GET request. The feature works at least for Metro that implements JAX-WS API v2.x. In order to get such info, a client sends any endpoint's URL to the server. The result is an HTML page with a table. Each row of such table contains an endpoint's data for each supported web-service method. This feature may be used as a web-services discovery mechanism.
The second feature is bad rather than good. JAX-WS API supposes that a developer annotates classes and methods that he/she wants to expose as web-services. Then, an implementation generates additional layer-bridge between developer's code and API that does all routine work behind the scene. May be that was a good idea, but Metro's implementation is imperfect. Metro dynamically generates such classes at run-time when a web-application starts. Moreover, Metro does such generation for all classes at once. So, in our case, when the generated web-based application contains dozens or even hundreds of web-services, the application's startup takes a lot of time.
Probably, Metro developers didn't want to deal with implementation of lazy algorithms, when a web-service is generated and cached on demand. We hope this issue will be solved in next releases.
A while ago we have created
a simple cache for Java application. It was modelled like a Map<K, V>: it cached
values for keys.
Use cases were:
Cache<String, Object> cache = new Cache<String, Object>();
...
instance = cache.get("key");
cache.put("key", instance);
But now we thought of different implementation like a WeakReference<V>
and with map access as additional utility methods.
Consider an examples:
1. Free standing CachedReference<V> instance.
CachedReference<Data> ref = new CachedReference<Data>(1000, true);
...
ref.set(data);
...
data = ref.get();
2. Map of CachedReference<V> instances.
ConcurrentHashMap<String, CachedReference<Data>> cache =
new
ConcurrentHashMap<String, CachedReference<Data>>();
CachedReference.put(cache, "key", data, 1000, true);
... data = CachedReference.get(cache, "key");
The first case is faster than original Cache<K, V> as it does not use any hash
map at all. The later case provides the same performance as Cache<K, V> but
gives a better control over the storage. Incidentally, CachedReference<V> is more compact than Cache<K, V>.
The new implementation is
CachedReference.java, the old one
Cache.java.
A method pattern we have suggested to use along with @Yield annotation brought
funny questions like: "why should I mark my method with @Yield annotation at
all?"
Well, in many cases you may live with ArrayList populated with data, and then to
perform iteration. But in some cases this approach is not practical either due
to amount of data or due to the time required to get first item.
In later
case you usually want to build an iterator that calculates items on demand. The @Yield annotation is designed as a marker of such methods. They are refactored
into state machines at compilation time, where each addition to a result list is
transformed into a new item yielded by the iterator.
So, if you have decided to use @Yield annotation then at some point you will ask yourself what
happens with resources acquired during iteration. Will resources be released if
iteration is interrupted in the middle due to exception or a break statement?
To address the problem yield iterator implements Closeable interface.
This way when you call close() before iteration reached the end, the state machine
works as if break statement of the method body is injected after the yield
point. Thus all finally blocks of the original method are executed and resources
are released.
Consider an example of data iterator:
@Yield
public Iterable<Data> getData(final Connection connection)
throws Exception
{
ArrayList<Data> result = new ArrayList<Data>();
PreparedStatement statement =
connection.prepareStatement("select key, value from table");
try
{
ResultSet resultSet = statement.executeQuery();
try
{
while(resultSet.next())
{
Data data = new Data();
data.key = resultSet.getInt(1);
data.value = resultSet.getString(2);
result.add(data); // yield point
}
}
finally
{
resultSet.close();
}
}
finally
{
statement.close();
}
return result;
}
private static void close(Object value)
throws IOException
{
if (value instanceof Closeable)
{
Closeable closeable = (Closeable)value;
closeable.close();
}
}
public void daoAction(Connection connection)
throws Exception
{
Iterable<Data> items = getData(connection);
try
{
for(Data data: items)
{
// do something that potentially throws exception.
}
}
finally
{
close(items);
}
}
getData() iterates over sql data. During the lifecycle it creates and releases
PreparedStatement and ResultSet.
daoAction() iterates over results provided by getData() and performs some
actions that potentially throw an exception. The goal of close() is to release
opened sql resources in case of such an exception.
Here you can inspect how state machine is implemented for such a method:
@Yield()
public static Iterable<Data> getData(final Connection connection)
throws Exception
{
assert (java.util.ArrayList<Data>)(ArrayList<Data>)null == null;
class $state implements java.lang.Iterable<Data>, java.util.Iterator<Data>, java.io.Closeable
{
public java.util.Iterator<Data> iterator() {
if ($state$id == 0) {
$state$id = 1;
return this;
} else return new $state();
}
public boolean hasNext() {
if (!$state$nextDefined) {
$state$hasNext = $state$next();
$state$nextDefined = true;
}
return $state$hasNext;
}
public Data next() {
if (!hasNext()) throw new java.util.NoSuchElementException();
$state$nextDefined = false;
return $state$next;
}
public void remove() {
throw new java.lang.UnsupportedOperationException();
}
public void close() {
do switch ($state$id) {
case 3:
$state$id2 = 8;
$state$id = 5;
continue;
default:
$state$id = 8;
continue;
} while ($state$next());
}
private boolean $state$next() {
java.lang.Throwable $state$exception;
while (true) {
try {
switch ($state$id) {
case 0:
$state$id = 1;
case 1:
statement = connection.prepareStatement("select key, value from table");
$state$exception1 = null;
$state$id1 = 8;
$state$id = 2;
case 2:
resultSet = statement.executeQuery();
$state$exception2 = null;
$state$id2 = 6;
$state$id = 3;
case 3:
if (!resultSet.next()) {
$state$id = 4;
continue;
}
data = new Data();
data.key = resultSet.getInt(1);
data.value = resultSet.getString(2);
$state$next = data;
$state$id = 3;
return true;
case 4:
$state$id = 5;
case 5:
{
resultSet.close();
}
if ($state$exception2 != null) {
$state$exception = $state$exception2;
break;
}
if ($state$id2 > 7) {
$state$id1 = $state$id2;
$state$id = 7;
} else $state$id = $state$id2;
continue;
case 6:
$state$id = 7;
case 7:
{
statement.close();
}
if ($state$exception1 != null) {
$state$exception = $state$exception1;
break;
}
$state$id = $state$id1;
continue;
case 8:
default:
return false;
}
} catch (java.lang.Throwable e) {
$state$exception = e;
}
switch ($state$id) {
case 3:
case 4:
$state$exception2 = $state$exception;
$state$id = 5;
continue;
case 2:
case 5:
case 6:
$state$exception1 = $state$exception;
$state$id = 7;
continue;
default:
$state$id = 8;
java.util.ConcurrentModificationException ce = new java.util.ConcurrentModificationException();
ce.initCause($state$exception);
throw ce;
}
}
}
private PreparedStatement statement;
private ResultSet resultSet;
private Data data;
private int $state$id;
private boolean $state$hasNext;
private boolean $state$nextDefined;
private Data $state$next;
private java.lang.Throwable $state$exception1;
private int $state$id1;
private java.lang.Throwable $state$exception2;
private int $state$id2;
}
return new $state();
}
Now, you can estimate for what it worth to write an algorithm as a sound state machine
comparing to the conventional implementation.
Yield annotation processor can be downloaded from
Yield.zip
or Yield.jar
See also
Yield return feature in java.
We're happy to announce that we have implemented @Yield annotation
both in javac and in eclipse compilers.
This way you get built-in IDE support for the feature!
To download yield annotation processor please use the following link:
Yield.zip
It contains both yield annotation processor, and a test project.
If you do not want to compile the sources, you can download
Yield.jar
We would like to reiterate on how @Yield annotation works:
- A developer defines a method that returns either
Iterator<T> or
Iterable<T> instance and marks it with @Yield
annotation.
- A developer implements iteration logic following the pattern:
- declare a variable to accumulate results:
ArrayList<T> items = new ArrayList<T>();
- use the following statement to add item to result:
items.add(...);
- use
return items;
or
return items.iterator();
to return result;
- mark method's params, if any, as final.
- A devoloper ensures that yield annotation processor is available during
compilation (see details below).
YieldProcessor rewrites method into a state machine at
compilation time.
The following is an example of such a method:
@Yield
public static Iterable<Integer> generate(final int from, final int to)
{
ArrayList<Integer> items = new ArrayList<Integer>();
for(int i = from; i < to; ++i)
{
items.add(i);
}
return items;
}
The use is like this:
for(int value: generate(7, 20))
{
System.out.println("generator: " + value);
}
Notice that method's implementation still will be correct in absence of
YieldProcessor.
Other important feature is that the state machine returned after the yield
processor is closeable.
This means that if you're breaking the iteration before the end is reached you
can release resources acquired during the iteration.
Consider the example where break exits iteration:
@Yield
public static Iterable<String> resourceIteration()
{
ArrayList<String> items = new ArrayList<String>();
acquire();
try
{
for(int i = 0; i < 100; ++i)
{
items.add(String.valueOf(i));
}
}
finally
{
release();
}
return items;
}
and the use
int i = 0;
Iterable<String> iterator = resourceIteration();
try
{
for(String item: iterator)
{
System.out.println("item " + i + ":" + item);
if (i++ > 30)
{
break;
}
}
}
finally
{
close(iterator);
}
...
private static <T> void close(T value)
throws IOException
{
if (value instanceof Closeable)
{
Closeable closeable = (Closeable)value;
closeable.close();
}
}
Close will execute all required finally blocks. This way resources will be
released.
To configure yield processor a developer needs to refer Yield.jar in build path,
as it contains @Yield annotation. For javac it's enough, as
compiler will find annotation processor automatically.
Eclipse users need to open project properties and:
- go to the "Java Compiler"/"Annotation Processing"
- mark "Enable project specific settings"
- select "Java Compiler"/"Annotation Processing"/"Factory Path"
- mark "Enable project specific settings"
- add Yield.jar to the list of "plug-ins and JARs that contain annotation
processors".
At the end we want to point that @Yield annotation is a syntactic
suggar, but it's important the way the foreach statement is important, as it
helps to write concise and an error free code.
See also
Yield feature in java implemented!
Yield feature in java
We could not stand the temptation to implement the @Yield annotation that
we described
earlier.
Idea is rather clear but people are saying that it's not an easy task to update
the sources.
They were right!
Implementation has its price, as we were forced to access JDK's classes of javac
compiler. As result, at present, we don't support other compilers such as
EclipseCompiler.
We shall look later what can be done in this area.
At present, annotation processor works perfectly when you run javac either from
the command line, from ant, or from other build tool.
Here is an example of how method is refactored:
@Yield
public static Iterable<Long> fibonachi()
{
ArrayList<Long> items = new ArrayList<Long>();
long Ti = 0;
long Ti1 = 1;
while(true)
{
items.add(Ti);
long value = Ti + Ti1;
Ti = Ti1;
Ti1 = value;
}
}
And that's how we transform it:
@Yield()
public static Iterable<Long> fibonachi() {
assert (java.util.ArrayList<Long>)(ArrayList<Long>)null == null : null;
class $state$ implements java.lang.Iterable<Long>, java.util.Iterator<Long>, java.io.Closeable {
public java.util.Iterator<Long> iterator() {
if ($state$id == 0) {
$state$id = 1;
return this;
} else return new $state$();
}
public boolean hasNext() {
if (!$state$nextDefined) {
$state$hasNext = $state$next();
$state$nextDefined = true;
}
return $state$hasNext;
}
public Long next() {
if (!hasNext()) throw new java.util.NoSuchElementException();
$state$nextDefined = false;
return $state$next;
}
public void remove() {
throw new java.lang.UnsupportedOperationException();
}
public void close() {
$state$id = 5;
}
private boolean $state$next() {
while (true) switch ($state$id) {
case 0:
$state$id = 1;
case 1:
Ti = 0;
Ti1 = 1;
case 2:
if (!true) {
$state$id = 4;
break;
}
$state$next = Ti;
$state$id = 3;
return true;
case 3:
value = Ti + Ti1;
Ti = Ti1;
Ti1 = value;
$state$id = 2;
break;
case 4:
case 5:
default:
$state$id = 5;
return false;
}
}
private long Ti;
private long Ti1;
private long value;
private int $state$id;
private boolean $state$hasNext;
private boolean $state$nextDefined;
private Long $state$next;
}
return new $state$();
}
Formatting is automatic, sorry, but anyway it's for diagnostics only. You
will never see this code.
It's iteresting to say that this implementation is very precisely mimics
xslt state machine implementation we have done back in 2008.
You can
download YieldProcessor here. We hope that someone will find our solution
very interesting.
Several times we have already wished to see
yield feature in java and all the time came to the same implementation:
infomancers-collections.
And every time with dissatisfaction turned away, and continued with regular
iterators.
Why? Well, in spite of the fact it's the best implementation of the feature we have
seen, it's still too heavy, as it's playing with java byte code at run-time.
We never grasped the idea why it's done this way, while there is
post-compile
time annotation processing in java.
If we would implemented the yeild feature in java we would created a @Yield
annotation and would demanded to implement some well defined code pattern like
this:
@Yield
Iteratable<String> iterator()
{
// This is part of pattern.
ArrayList<String> list = new ArrayList<String>();
for(int i = 0; i < 10; ++i)
{
// list.add() plays the role of yield return.
list.add(String.valueOf(i));
}
// This is part of pattern.
return list;
}
or
@Yield
Iterator<String> iterator()
{
// This is part of pattern.
ArrayList<String> list = new ArrayList<String>();
for(int i = 0; i < 10; ++i)
{
// list.add() plays the role of yield return.
list.add(String.valueOf(i));
}
// This is part of pattern.
return list.iterator();
}
Note that the code will work correctly even, if by mischance, post-compile-time
processing will not take place.
At post comile time we would do all required refactoring to turn these
implementations into a state machines thus runtime would not contain any third
party components.
It's iteresting to recall that we have also implemented similar refactoring in
pure xslt.
See What you can do with jxom.
Update: implementation can be found at Yield.zip
We have a class Beans used to serialize a list of generic objects into an xml.
This is done like this:
public class Call
{
public Beans input;
public Beans output;
...
}
@XmlJavaTypeAdapter(value = BeanAdapter.class)
public class Beans
{
public List<Object> bean;
}
Thanks to @XmlJavaTypeAdapter, we're able to write xml in
whatever form we want.
When we're serializing a Call instance:
Call call = ...
Beans beans = ...;
call.setInput(beans);
JAXBContext context = ...;
Marshaller marshaler = context.createMarshaller();
ObjectFactory factory = ...;
marshaler.marshal(factory.createCall(call),
result);
things work as expected, meaning that BeanAdapter is used during xml
serialization. But if it's happened that you want to serialize a Beans instance
itself, you start getting problems with the serialization of unknown objects. That's because JAXB does not use BeanAdapter.
We have found a similar case "How to assign an adapter
to the root element?", unfortunately with no satisfactory explanation.
That is strange.
michaelhkay: Saxon 9.3 has been out for 8 days: only two bugs so far, one found by me. I think that's a record.
Not necessary. We, for example, who use Saxon HE, have found nothing new in Saxon 9.3, while expected to see xslt 3.0. Disappointed. No actual reason to migrate.
P.S. We were among the first who were finding early bugs in previous releases.
Reading individual papers of C++ WG, you can find the following one:
|
N3174
|
10-0164
|
To move or not to move
|
Bjarne Stroustrup
|
2010-10-17
|
2010-10
|
|
Core
|
There, Bjarne Stroustrup thinks about issues with implicitly generated copy and
move operations in C++.
It's always a pleasure to see how one can deal with a problem burdened with
antagonisms. To conduct his position Bjarne skilfully uses not only rational but
also emotional argumentation:
...We may deem this “bad code that deserves to be broken” or “unrealistic”, but
this example demonstrates that the problem with a generated move has an exact counterpart
for copy (which we have lived with for 27 years)...
...In 1984, I missed the chance to protect us against copy and we have lived with
the problems ever since. I should have instituted some rule along the lines “if
a class has a destructor, no copy operations are generated” or “if a class has a
pointer member, no copy operations are generated.”...
It's impossible to recall this numbers without shivering. :-)
We're following w3's "Bug
9069 - Function to invoke an XSLT transformation".
There, people argue about xpath API to invoke xslt transformations. Function should
look roughly like this:
transform
(
$node-tree as node()?,
$stylesheet as item(),
$parameters as XXX
) as node()
The discussion is spinning around the last argument: $parameters as
XXX. Should it be an xml element describing parameters, a function returning values for parameter names, or some new type modelling immutable
map?
What is most interesting in this discussion is the leak about plans to introduce
a map type:
Comment 7 Michael Kay, 2010-09-14 22:46:58 UTC
We're currently talking about adding an immutable map to XSLT as a new data
type (the put operation would return a new map). There appear to be a number of
possible efficient implementations. It would be ideally suited for this purpose,
because unlike the mechanism used for serialization parameters, the values can be
any data type (including nodes), not only strings.
There is a hope that map will finally appear in xslt!
See also:
Bug 5630
- [DM] Tuples and maps,
Tuples and maps - Status: CLOSED, WONTFIX,
Map, based on immutable trees,
Maps in exslt2?
In the
previous post we have announced an
API to parse a COBOL source into the cobolxom.
We exploited the
incremental parser to build a grammar xml tree and then were planning to
create an xslt transformation to generate
cobolxom.
Now, we would like to declare that such xslt is ready.
At present all standard COBOL constructs are supported, but more tests
are required. Preprocessor support is still in the todo list.
You may peek into an examples of
COBOL:
Cobol grammar:
And
cobolxom:
While we were building a grammar to cobolxom stylesheet we asked ourselves
whether the COBOL parsing could be done entirely in xslt. The answer is yes, so
who knows it might be that we shall turn this task into pure xslt one. :-)
Accidentally we have found that implementation of String and StringBuilder
have been considerably revised, while public interface has remained the
same.
public sealed class String
{
private int m_arrayLength;
private int m_stringLength;
private char
m_firstChar;
}
This layout is dated to .NET 1.0.
VM, in fact, allocates more memory than that defined in C# class, as
&m_firstChar refers to an inline char buffer.
This way string's buffer length and string's length were two different
values, thus StringBuilder used this fact and stored its content in a private string
which it modified in place.
In .NET 4, string is different:
public sealed class String
{
private int m_stringLength;
private char
m_firstChar;
}
Memory footprint of such structure is smaller, but string's length should
always be the same as its buffer. In fact layout of string is now the same as
layout of char[].
This modification leads to implementation redesign of the StringBuilder.
Earlier, StringBuilder looked like the following:
public sealed class StringBuilder
{
internal IntPtr m_currentThread;
internal int m_MaxCapacity;
internal volatile
string m_StringValue;
}
Notice that m_StringValue is used as a storage, and
m_currentThread is used to preserve thread affinity of the internal
string value.
Now, guys at Microsoft have decided to implement StringBuilder very differently:
public sealed class StringBuilder
{
internal int m_MaxCapacity;
internal int m_ChunkLength;
internal int m_ChunkOffset;
internal char[] m_ChunkChars;
internal StringBuilder m_ChunkPrevious;
}
Inspection of this layout immediately reveals implementation technique. It's a
list of chunks. Instance itself references the last chunk (most recently
appended), and the previous chunks.
Characteristics of this design are:
- while
Length is small, performance almost the same as it was earlier;
- there are no more thread affinity checks;
Append(), and ToString() works as fast a in the old version.
Insert() in the middle works faster, as only a chuck should be splitted and
probably reallocated (copied), instead of the whole string;
- Random access is fast at the end O(1) and slows when you approaching the start
O(chunk-count).
Personally, we would select a slightly different design:
public sealed class StringBuilder
{
private struct Chunk
{
public int length; // Chunk length.
public int offset; // Chunk offset.
public char[] buffer;
}
private int m_MaxCapacity;
// Alternatively, one can use
// private List<Chunk> chunks;
private int chunkCount; // Number of used chunks.
private Chunk[] chunks; // Array of chunks except last.
private Chunk last; // Last chunk.
private bool nonHomogenous; // false if all chunks are of the same size.
}
This design has better memory footprint, and random access time is O(1) when there were no
inserts in the middle (nonHomogenous=false), and
O(log(chunkCount)) after such inserts. All other characteristics are the
same.
We have run into another xslt bug, which depends on several independent
circumstances and often behaves differently being observed. That's clearly a
Heisenbug.
Xslt designers failed to realize that a syntactic suggar they introduce into
xpath can turn into obscure bugs. Well, it's easy to be wise afterwards...
To the point.
Consider you have a sequence consisting of text nodes and
elements, and now you want to "normalize" this sequence wrapping
adjacent text nodes into
separate elements. The following stylesheet is supposed to do the work:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="http://www.nesterovsky-bros.com/xslt/this"
exclude-result-prefixes="xs t">
<xsl:template match="/">
<xsl:variable
name="nodes" as="node()*">
<xsl:text>Hello, </xsl:text>
<string value="World"/>
<xsl:text>! </xsl:text>
<xsl:text>Well, </xsl:text>
<string value="hello"/>
<xsl:text>, if not joking!</xsl:text>
</xsl:variable>
<result>
<xsl:sequence
select="t:normalize($nodes)"/>
</result>
</xsl:template>
<xsl:function
name="t:normalize" as="node()*">
<xsl:param name="nodes" as="node()*"/>
<xsl:for-each-group select="$nodes" group-starting-with="*">
<xsl:variable
name="string" as="element()?" select="self::string"/>
<xsl:variable name="texts"
as="node()*"
select="current-group() except $string"/>
<xsl:sequence
select="$string"/>
<xsl:if test="exists($texts)">
<string
value="{string-join($texts, '')}"/>
</xsl:if>
</xsl:for-each-group>
</xsl:function>
</xsl:stylesheet>
We're expecting the following output:
<result>
<string value="Hello, "/>
<string value="World"/>
<string value="! Well, "/>
<string value="hello"/>
<string value=", if not joking!"/>
</result>
But often we're getting other results, like:
<result>
<string value="Hello, "/>
<string value="World"/>
<string value="Well, ! "/>
<string value="hello"/>
<string value=", if not joking!"/>
</result>
Such output may seriously confuse, unless you will recall the rule for the
xpath except operator:
The except operator takes two node sequences as operands and returns a sequence containing all the nodes that occur in the first operand but not in the second operand.
... these operators eliminate duplicate nodes from their result sequences based
on node identity. The resulting sequence is returned in document order..
...
The relative order of nodes in distinct trees is stable but implementation-dependent
These words mean that result sequence may be very different from original
sequence.
In contrast, if we change $text definition to:
<xsl:variable name="texts"
as="node()*"
select="current-group()[not(. is $string)]"/>
then the result becomes stable, but less clear.
See also
Xslt Heisenbug
It does not matter that DataBindExtender looks not usual in the ASP.NET. It turns to be so handy that built-in data binding is not considered to be an option.
After a short try, you uderstand that people tried very hard and have invented many controls and methods like ObjectDataSource, FormView, Eval(), and Bind() with outcome, which is very specific and limited.
In contrast DataBindExtender performs:
- Two or one way data binding of any business data property to any control property;
- Converts value before it's passed to the control, or into the business data;
- Validates the value.
See an example:
<asp:TextBox id=Field8 EnableViewState="false" runat="server"></asp:TextBox> <bphx:DataBindExtender runat='server' EnableViewState='false' TargetControlID='Field8' ControlProperty='Text' DataSource='<%# Import.ClearingMemberFirm %>' DataMember='Id' Converter='<%# Converters.AsString("XXXXX", false) %>' Validator='<%# (extender, value) => Functions.CheckID(value as string) %>'/>
Here, we beside a regualar two way data binding of a property Import.ClearingMemberFirm.Id to a property Field8.Text, format (parse) Converters.AsString("XXXXX", false), and finally validate an input value with a lambda function (extender, value) => Functions.CheckID(value as string).
DataBindExtender works also well in template controls like asp:Repeater, asp:GridView, and so on. Having your business data available, you may reduce a size of the ViewState with EnableViewState='false'. This way DataBindExtender approaches page development to a pattern called MVC.
Recently, we have found that it's also useful to have a way to run a javascript during the page load (e.g. you want to attach some client side event, or register a component). DataBindExtender provides this with OnClientInit property, which is a javascript to run on a client, where this refers to a DOM element:
... OnClientInit='$addHandler(this, "change", function() { handleEvent(event, "Field8"); } );'/>
allows us to attach onchange javascript event to the asp:TextBox.
So, meantime we're very satisfied with what we can achieve with DataBindExtender. It's more than JSF allows, and much more stronger and neater to what ASP.NET has provided.
The sources can be found at DataBindExtender.cs
Lately, we have found that we've accustomed to declare C#'s local variables using var:
var exitStateName = exitState == null ? "" : exitState.Name;
var rules = Environment.NavigationRules;
var rule = rules[caller.Name];
var flow = rule.NavigationCases[procedure.OriginExitState];
This makes code cleaner, and in presense of good IDE still allows to figure out
types very easely.
We, howerer, found that var tends to have exceptions in its
uses. E.g. for some reason most of boolean locals in our code tend to remain explicit
(matter of taste?):
bool succeed = false;
try
{
...
succeed = true;
}
finally
{
if (!succeed)
{
...
}
}
Also, type often survives in for, but not in foreach:
for(int i = 0; i < sourceDataMapping.Length;
++i)
{
...
}
foreach(var property in properties)
{
...
}
In addition var has some limitations, as one cannot easily
initialize such local with null. From the following we prefer the first approach:
IWindowContext context = null;
var context = (IWindowContext)null;
var context = null as IWindowContext;
var context = default(IWindowContext);
We might need to figure out a consistent code style as for var. It
might be like that:
- Numeric, booleans and string locals should use explicit type;
- Try to avoid locals initialized with null, or without initializer, or use type
if such variable cannot be avoided;
- Use var in all other cases.
Another code style could be like that:
- For the consistency, completely avoid the use of keyword
var.
Recently we were raising a question about serialization of ASPX output in xslt.
The question went like this:
What's the recommended way of ASPX page generation? E.g.:
------------------------ <%@ Page AutoEventWireup="true" CodeBehind="CurMainMenuP.aspx.cs" EnableSessionState="True" Inherits="Currency.CurMainMenuP" Language="C#" MaintainScrollPositionOnPostback="True" MasterPageFile="Screen.Master" %>
<asp:Content ID="Content1" runat="server" ContentPlaceHolderID="Title">CUR_MAIN_MENU_P</asp:Content>
<asp:Content ID="Content2" runat="server" ContentPlaceHolderID="Content"> <span id="id1222146581" runat="server" class="inputField system UpperCase" enableviewstate="false"> <%# Dialog.Global.TranCode %> </span> ... ------------------------
Notice aspx page directives, data binding expessions, and prefixed tag names without namespace declarations.
There was a whole range of expected answers. We, however, looked whether somebody have already dealed with the task and has a ready solution at hands.
In general it seems that xslt community is very angry about ASPX: both format and technology. Well, put this aside.
The task of producing ASPX, which is almost xml, is not solvable when you're staying with pure xml serializer. Xslt's xsl:character-map does not work at all. In fact it looks as a childish attempt to address the problem, as it does not support character escapes but only grabs characters and substitutes them with strings.
We have decided to create ASPX serializer API producing required output text. This way you use <xsl:output method="text"/> to generate ASPX pages.
With this goal in mind we have defined a little xml schema to describe ASPX irregularities in xml form. These are:
<xs:element name="declared-prefix"> - to describe known prefixes, which should not be declared;
<xs:element name="directive"> - to describe directives like <%@ Page %>;
<xs:element name="content"> - a transparent content wrapper;
<xs:element name="entity"> - to issue xml entity;
<xs:element name="expression"> - to describe aspx expression like <%# Eval("A") %>;
<xs:element name="attribute"> - to describe an attribute of the parent element.
This approach greately simplified for us an ASPX generation process.
The API includes:
In previous posts we were crying about problems with JSF to ASP.NET migration. Let's point to another one.
Consider that you have an input field, whose value should be validated:
<input type="text" runat="server" ID="id1222146409" maxlength="4"/> <bphx:DataBindExtender runat="server" TargetControlID="id1222146409" ControlProperty="Value" DataSource="<%# Import.AaControlAttributes %>" DataMember="UserEnteredTrancode"/>
Here we have an input control, whose value is bound to Import.AaControlAttributes.UserEnteredTrancode property. But what is missed is a value validation. Somewhere we have a function that could answer the question whether the value is valid. It should be called like this: Functions.IsTransactionCodeValid(value).
Staying within standard components we can use a custom validator on the page:
<asp:CustomValidator runat="server" ControlToValidate="id1222146409" OnServerValidate="ValidateTransaction" ErrorMessage="Invalid transaction code."/>
and add the following code-behind:
protected void ValidateTransaction(object source, ServerValidateEventArgs args) { args.IsValid = Functions.IsTransactionCodeValid(args.Value); }
This approach works, however it pollutes the code-behind with many very similar methods. The problem is that the validation rules in most cases are not property of page but one of data model. That's why page validation methods just forward check to somewhere.
While thinking on how to simplify the code we have came up with more conscious and short way to express validators, namely using lambda functions. To that end we have introduced a Validator property of type ValueValidator over DataBindExtender. Where
/// <summary>A delegate to validate values.</summary> /// <param name="extender">An extender instance.</param> /// <param name="value">A value to validate.</param> /// <returns>true for valid value, and false otherwise.</returns> public delegate bool ValueValidator(DataBindExtender extender, object value);
/// <summary>An optional data member validator.</summary> public virtual ValueValidator Validator { get; set; }
With this new property the page markup looks like this:
<input type="text" runat="server" ID="id1222146409" maxlength="4"/> <bphx:DataBindExtender runat="server" TargetControlID="id1222146409" ControlProperty="Value" DataSource="<%# Import.AaControlAttributes %>" DataMember="UserEnteredTrancode" Validator='<%# (extender, value) => Functions.IsTransactionCodeValid(value as string) %>' ErrorMessage="Invalid transaction code."/>
This is almost like an event handler, however it allowed us to call data model validation logic without unnecessary code-behind.
The updated DataBindExtender can be found at DataBindExtender.cs.
Being well behind of the latest news and traps of the ASP.NET, we're readily falling on each problem. :-)
This time it's a script injection during data binding.
In JSF there is a component to output data called h:outputText. Its use is like this:
<span jsfc="h:outputText" value="#{myBean.myProperty}"/>
The output is a span element with data bound value embeded into content. The natural alternative in ASP.NET seems to be an asp:Label control:
<asp:Label runat="server" Text="<%# Eval("MyProperty") %>"/>
This almost works except that the h:outputText escapes data (you may override this and specify attribute escape="false"), and asp:Label never escapes the data.
This looks as a very serious omission in ASP.NET (in fact very close to a security hole). What are chances that when you're creating a new page, which uses data binding, you will not forget to fix code that wizard created for you and to change it to:
<asp:Label runat="server" Text="<%# Server.HtmlEncode(Eval("MyProperty")) %>"/>
Eh? Think what will happen if MyProperty will return a text that looks like a script (e.g.: <script>alert(1)</script>), while you just wanted to output a label?
To address the issue we've also introduced a property Escape into DataBindExtender. So at present we have a code like this:
<asp:Label runat="server" ID="MyLabel"/> <bphx:DataBindExtender runat="server" TargetControlID="MyLabel" ControlProperty="Text" ReadOnly="true" Escape="true" DataSource="<%# MyBean %>" DataMember="MyProperty"/>
See also: A DataBindExtender, Experience of JSF to ASP.NET migration
After struggling with ASP.NET data binding we found no other way but to introduce our little extender control to address the issue.
We were trying to be minimalistic and to introduce two way data binding and to support data conversion. This way extender control (called DataBindExtender) have following page syntax:
<asp:TextBox id=TextBox1 runat="server"></asp:TextBox> <cc1:DataBindExtender runat="server" DataSource="<%# Data %>" DataMember="ID" TargetControlID="TextBox1" ControlProperty="Text" />
Two way data binding is provided with DataSource object (notice data binding over this property) and a DataMember property from the one side, and TargetControlID and ControlProperty from the other side. DataBindExtender supports Converter property of type TypeConverter to support custom converters.
DataBindExtender is based on AjaxControlToolkit.ExtenderControlBase class and implements System.Web.UI.IValidator. ExtenderControlBase makes implementation of extenders extremely easy, while IValidator plugs natuarally into page validation (Validate method, Validators collections, ValidationSummary control).
The good point about extenders is that they are not visible in designer, while it exposes properties in extended control itself. The disadvantage is that it requires Ajax Control Toolkit, and also ScriptManager component of the page.
To simplify the use DataBindExtender gets data from control and puts the value into data source in Validate method, and puts data into control in OnPreRender method; thus no specific action is required to perform data binding.
Source for the DataBindExtender is DataBindExtender.cs.
We used to think that ASP.NET is a way too powerful than JSF. It might be still true, but not when you are accustomed to JSF and spoiled with its code practice...
Looking at both technologies from a greater distance, we now realize that they give almost the same level of comfort during development, but they are different. You can feel this after you were working for some time with one technology and now are to implement similar solution in opposite one. That is where we have found ourselves at present.
The funny thing is that we did expect some problems but in a different place. Indeed, both ASP.NET and JSF are means to define a page layout and to map input and output of business data. While with the presentation (controls, their compositions, masters, styles and so on) you can find more or less equal analogies, the differences of implementation of data binding is a kind of a pain.
We have found that data binding in ASP.NET is somewhat awkward. Its Eval and Bind is bearable in simple cases but almost unusable when you business data is less trivial, or if you have to apply custom data formatting.
In JSF, with its Expression Language, we can perform two way data binding for rather complex properties like ${data.items[index + 5].property}, or to create property adapters ${my:asSomething(data.bean, "property").Value}, or add standard or custom property converters. In contrast data binding in ASP.NET is limited to simple property path (no expressions are supported), neither custom formatters are supported (try to format number as a telephone number).
Things work well when you're designing ASP.NET application from scratch, as you naturally avoid pitfalls, however when you got existing business logic and need to expose it to the web, you have no other way but to write a lot of code behind just to smooth out the problems that ASP.NET exhibits.
Another solution would be to design something like extender control that would attach more proper data binding and formatting facilities to control properties. That would allow to make page definitions in more declarative way, like what we have now in JSF.
While porting a solution from JSF to ASP.NET we have seen an issue with synchronization of access to a data stored in a session from multiple requests.
Consider a case when you store a business object in a session.
Going through the request lifecycle we observe that this business object may be accessed at different stages: data binding, postback event handler, security filters, other.
Usually this business object is mutable and does not assume concurent access. Browsers, however, may easily issue multiple requests to the same session at the same time. In fact, such behaviour, is not even an exception, as browsers nowadays are often sending concurrent requests.
In the JSF we're using a sync object, which is part of business object itself; lock it and unlock at the begin and at the end of a request correspondingly. This works perfectly as JSF guarantees that:
- lock is released after it's acquired (we use request scope bean with
@PostConstruct and @PreDestroy annotations to lock and unlock);
- both lock and unlock take place in the same thread.
ASP.NET, in contrast, tries to be more asynchronous, and allows for different stages of request to take place in different threads. This could be seen indirectly in the documentation, which does not give any commitments in this regards, and with code inspection where you can see that request can begin in one thread, and a next stage can be queued for the execution into the other thread.
In addition, ASP.NET does not guarantee that if BeginRequest has been executed then EndRequest will also run.
The conclusion is that we should not use locks to synchronize access to the same session object, but rather try to invent other means to avoid data races.
Update msdn states:
Concurrent Requests and Session State
Access to ASP.NET session state is exclusive per session, which means that if two different users make concurrent requests, access to each separate session is granted concurrently. However, if two concurrent requests are made for the same session (by using the same SessionID value), the first request gets exclusive access to the session information. The second request executes only after the first request is finished. (The second session can also get access if the exclusive lock on the information is freed because the first request exceeds the lock time-out.)
This means that the required synchronization is already built into ASP.NET. That's good.
We have implemented report parser in C#. Bacause things are spinned around C#, a
schema definition is changed.
We have started from classes defining a report definition tree, annotated these
classes for xml serialization, and, finally, produced xml schema for such tree.
So, at present, it is not an xml schema with annotations but a separate xml
schema.
In addition we have defined APIs:
- to enumerate report data (having report definition and report data one can get
IEnumerable<ViewValue> to iterate report data in structured form);
- to read report through
XmlReader, which allows, for example, to
have report as input for an xslt tranformation.
- to write report directly into
XmlWriter.
An example of report definition as C# code is:
MyReport.cs. The very same report definition but serialized into xml is
my-report.xml. A generated xml schema for a report definition is:
schema0.xsd.
The good point about this solution is that it's already flexible enough to
describe every report layout we have at hands, and it's extendable. Our
measurments show that report parsing is extremely fast and have very small
memory footprint due to forward only nature of report definitions.
From the design point of view report definition is a view of original text data
with view info attached.
At present we have defined following views:
- Element - a named view to generate output from a content view;
- Content - a view to aggregate other views together;
- Choice - a view to produce output from one of content views;
- Sequence - a view to sequence input view by key expressions, and to attach an
index to each sequence item;
- Iterator - a view to generate output from input view while some condition is
true, and to attach an iteration index to each part of output view;
- Page - a view to remove page headers and footers in the input view, and to
attach an index to each page;
- Compute - a named view to produce result of evaluation of expression as output
view;
- Data - a named view to produce output value from some bounds of input view,
and optionally to convert, validate and format the value.
To specify details of definitions there are:
- expressions to deal with integers:
Add, Div,
Integer, MatchProperty, Max, Min,
Mod, Mul, Neg, Null,
Sub, VariableRef, ViewProperty, Case;
- conditions to deal with booleans:
And, EQ, GE,
GT, IsMatch, LE, LT,
NE, Not, Or.
At present there is no a specification of a report definitions. Probably, it's
the most complex part to create such a spec for a user without deep knowledge.
At present, our idea is that one should use xml schema (we should polish
generated schema) for the report definition and schema aware editor to build
report definitions. That's very robust approach working perfectly with
languages xom.
C# sources can be found at:
ReportLayout.zip including report definition classes and a sample report.
We're facing a task of parsing reports produced from legacy applications and
converting them into a structured form, e.g. into xml. These xml files can be
processed further with up to date tools to produce good looking reports.
Reports at hands are of very different structure and of size: from a couple of KB
to a several GB. The good part is that they mostly have a tabular form, so it's
easy to think of specific parsers in case of each report type.
Our goal is to create an environment where a less qualified person(s) could
create and manage such parsers, and only rarely to engage someone who will handle
less untrivial cases.
Our analysis has shown that it's possible to write such parser in almost any
language: xslt, C#, java.
Our approach was to create an xml schema annotations
that from one side define a data structure, and from the other map report
layout. Then we're able to create an xslt that will generate either xslt, C#, or
java parser according to the schema definitions. Because of
languages
xom, providing XML Object Model and serialization stylesheets for C# and
java, it does not really matter what we shall generate xslt or C#/java, as
code will look the same.
The approach we're going to use to describe reports is
not as powerfull as conventional parsers. Its virtue, however, is simplicity of
specification.
Consider a report sample (a data to extract is in bold):
1 TITLE ... PAGE: 1
BUSINESS DATE: 09/30/09 ... RUN DATE: 02/23/10
CYCLE : ITD RUN: 001 ... RUN TIME: 09:22:39
CM BUS ...
CO NBR FRM FUNC ...
----- ----- ----- -----
XXX 065 065 CLR ...
YYY ...
...
1 TITLE ... PAGE: 2
BUSINESS DATE: 09/30/09 ... RUN DATE: 02/23/10
CYCLE : ITD RUN: 001 ... RUN TIME: 09:22:39
CM BUS ...
CO NBR FRM FUNC ...
----- ----- ----- -----
AAA NNN MMM PPP ...
BBB ...
...
* * * * * E N D O F R E P O R T * * * * *
We're approaching to the report through a sequence of views (filters) of this
report. Each veiw localizes some report data either for the subsequent
filterring or for the extraction of final data.
Looking into the example one can build following views of the report:
- View of data before the "E N D O F R E P O R T" line.
- View of remaining data without page headers and footers.
- Views of table rows.
- Views of cells.
A sequence of filters allows us to build a pipeline of transformations of
original text. This also allows us to generate a clean xslt, C# or java code
to parse the data.
At first, our favorite language for such parser was xslt.
Unfortunatelly, we're dealing with Saxon xslt implementation, which is not very
strong in streaming processing. Without a couple of extension functions to
prevent caching, it tends to cache whole input in the memory, which is not
acceptable.
At present we have decided to start from C# code, which is pure C# naturally.
:-)
Code still is in the development but at present we would like to share the xml
schema annotations describing report layout:
report-mapping.xsd, and a sample of report description:
test.xsd.
To see that the problem with Generator functions in xslt
is a bit more complicated compare two functions.
The first one is quoted from the earlier post:
<xsl:function name="t:generate" as="xs:integer*">
<xsl:param name="value" as="xs:integer"/>
<xsl:sequence select="$value"/>
<xsl:sequence select="t:generate($value * 2)"/>
</xsl:function>
It does not work in Saxon: crashes with out of memory.
The second one is slightly modified version of the same function:
<xsl:function name="t:generate" as="xs:integer*">
<xsl:param name="value" as="xs:integer"/>
<xsl:sequence select="$value + 0"/>
<xsl:sequence select="t:generate($value * 2)"/>
</xsl:function>
It's working without problems. In first case Saxon decides to cache all
function's output, in the second case it decides to evaluate data lazily on
demand.
It seems that optimization algorithms implemented in Saxon are so plentiful and
complex that at times they fool one another. :-)
See also:
Generator functions
There are some complications with streamed tree that we have implemented in
saxon. They are due to the fact that only a view of input data is available at
any time. Whenever you access some element that's is not available you're
getting an exception.
Consider an example. We have a log created with java logging. It looks like
this:
<log>
<record>
<date>...</date>
<millis>...</millis>
<sequence>...</sequence>
<logger>...</logger>
<level>INFO</level>
<class>...</class>
<method>...</method>
<thread>...</thread>
<message>...</message>
</record>
<record>
...
</record>
...
We would like to write an xslt that returns a page
of log as html:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="http://www.nesterovsky-bros.com/xslt/this"
xmlns="http://www.w3.org/1999/xhtml"
exclude-result-prefixes="xs t">
<xsl:param name="start-page" as="xs:integer" select="1"/>
<xsl:param name="page-size" as="xs:integer" select="50"/>
<xsl:output method="xhtml" byte-order-mark="yes" indent="yes"/>
<!-- Entry point. -->
<xsl:template match="/log">
<xsl:variable name="start" as="xs:integer"
select="($start-page - 1) * $page-size + 1"/>
<xsl:variable name="records" as="element()*"
select="subsequence(record, $start, $page-size)"/>
<html>
<head>
<title>
<xsl:text>A log file. Page: </xsl:text>
<xsl:value-of select="$start-page"/>
</title>
</head>
<body>
<table border="1">
<thead>
<tr>
<th>Level</th>
<th>Message</th>
</tr>
</thead>
<tbody>
<xsl:apply-templates mode="t:record" select="$records"/>
</tbody>
</table>
</body>
</html>
</xsl:template>
<xsl:template mode="t:record" match="record">
<!-- Make a copy of record to avoid streaming access problems. -->
<xsl:variable name="log">
<xsl:copy-of select="."/>
</xsl:variable>
<xsl:variable name="level" as="xs:string"
select="$log/record/level"/>
<xsl:variable name="message" as="xs:string"
select="$log/record/message"/>
<tr>
<td>
<xsl:value-of select="$level"/>
</td>
<td>
<xsl:value-of select="$message"/>
</td>
</tr>
</xsl:template>
</xsl:stylesheet>
This code does not work. Guess why? Yes, it's subsequence(), which is too greedy.
It always wants to know what's the next node, so it naturally skips a content of
the current node. Algorithmically, such saxon code could be rewritten, and could possibly work
better also in modes other than streaming.
A viable workaround, which does not use subsequence, looks rather untrivial:
<!-- Entry point. -->
<xsl:template match="/log">
<xsl:variable name="start" as="xs:integer"
select="($start-page - 1) * $page-size + 1"/>
<xsl:variable name="end" as="xs:integer"
select="$start + $page-size"/>
<html>
<head>
<title>
<xsl:text>A log file. Page: </xsl:text>
<xsl:value-of select="$start-page"/>
</title>
</head>
<body>
<table border="1">
<thead>
<tr>
<th>Level</th>
<th>Message</th>
</tr>
</thead>
<tbody>
<xsl:sequence select="
t:generate-records(record, $start, $end, ())"/>
</tbody>
</table>
</body>
</html>
</xsl:template>
<xsl:function name="t:generate-records" as="element()*">
<xsl:param name="records" as="element()*"/>
<xsl:param name="start" as="xs:integer"/>
<xsl:param name="end" as="xs:integer?"/>
<xsl:param name="result" as="element()*"/>
<xsl:variable name="record" as="element()?" select="$records[$start]"/>
<xsl:choose>
<xsl:when test="(exists($end) and ($start > $end)) or empty($record)">
<xsl:sequence select="$result"/>
</xsl:when>
<xsl:otherwise>
<!-- Make a copy of record to avoid streaming access problems. -->
<xsl:variable name="log">
<xsl:copy-of select="$record"/>
</xsl:variable>
<xsl:variable name="level" as="xs:string"
select="$log/record/level"/>
<xsl:variable name="message" as="xs:string"
select="$log/record/message"/>
<xsl:variable name="next-result" as="element()*">
<tr>
<td>
<xsl:value-of select="$level"/>
</td>
<td>
<xsl:value-of select="$message"/>
</td>
</tr>
</xsl:variable>
<xsl:sequence select="
t:generate-records
(
$records,
$start + 1,
$end,
($result, $next-result)
)"/>
</xsl:otherwise>
</xsl:choose>
</xsl:function>
Here we observed the greediness of saxon, which too early tried to consume
more input than it's required. In the other cases we have seen that it may
defer actual data access to the point when there is no data anymore.
So, without tuning internal saxon logic it's possible but not easy to write
stylesheets that exploit streaming features.
P.S. Updated sources are at
streamedtree.zip
At some point we needed to have an array with volatile elements in java.
We knew that such beast is not found in the java world. So we searched
the Internet and found the answers that are so wrong, and introduce so obscure
threading bugs that the guys who provided them would better hide them and run immediately to fix their
buggy programs...
The first one is
Volatile arrays
in Java. They suggest such solution:
volatile int[] arr = new int[...];
...
arr[4] = 100;
arr = arr;
The number two:
What Volatile Means in Java
A guy assures that this code works:
Fields:
int answer = 0;
volatile boolean ready = false;
Thread1:
answer = 42;
ready = true;
Thread2:
if (ready)
{
print(answer);
}
They are very wrong! Non volatile access can be reordered by the implementation.
See Java's
Threads and Locks:
The rules for volatile variables effectively require that main memory be touched exactly once for each use or assign of a volatile variable by a thread, and that main memory be touched in exactly the order dictated by the thread execution semantics. However, such memory actions are not ordered with respect to read and write actions on nonvolatile variables.
They probably thought of locks when they argued about volatiles:
a lock action acts as if it flushes all variables from the
thread's working memory; before use they must be assigned or loaded from main
memory.
P.S. They would better recommend
AtomicReferenceArray.
When time has come to process big xml log files we've decided to implement
streamable tree in saxon the very same way it was implemented in .net eight
years ago (see
How would we approach to streaming facility in xslt).
It's interesting enough that the implementation is similar to one of
composable tree. There a node never stores a reference to a parent, while in
the streamed tree no references to children are stored. This way only a limited
subview of tree is available at any time. Implementation does not support
preceding and preceding-sibling axes. Also, one cannot navigate to a node that
is out of scope.
Implementation is external (there are no changes to saxon itself). To use it one
needs to create an instance of DocumentInfo, which pulls data from
XMLStreamReader, and
to pass it as an input to a transformation:
Controller controller =
(Controller)transformer;
XMLInputFactory factory = XMLInputFactory.newInstance();
StreamSource inputSource = new StreamSource(new File(input));
XMLStreamReader reader = factory.createXMLStreamReader(inputSource);
StaxBridge bridge = new StaxBridge();
bridge.setPipelineConfiguration(
controller.makePipelineConfiguration());
bridge.setXMLStreamReader(reader);
inputSource = new DocumentImpl(bridge);
transformer.transform(inputSource, new StreamResult(output));
This helped us to format an xml log file of arbitrary size. An xslt like this can do the work:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns="http://www.w3.org/1999/xhtml"
exclude-result-prefixes="xs">
<xsl:template match="/log">
<html>
<head>
<title>Log</title>
</head>
<body>
<xsl:apply-templates/>
</body>
</html>
</xsl:template>
<xsl:template match="message">
...
</xsl:template>
<xsl:template match="message[@error]">
...
</xsl:template>
...
</xsl:stylesheet>
Implementation can be found at:
streamedtree.zip
By the generator we assume a function that produces an infinitive output
sequence for a particular input.
That's a rather theoretical question, as xslt does not allow infinitive
sequence, but look at the example:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="http://www.nesterovsky-bros.com/xslt"
exclude-result-prefixes="xs t">
<xsl:template match="/">
<xsl:variable name="value" as="xs:string" select="'10100101'"/>
<xsl:variable name="values" as="xs:integer+"
select="t:generate(1)"/>
<!--<xsl:variable name="values" as="xs:integer+">
<xsl:call-template name="t:generate">
<xsl:with-param name="value" select="1"/>
</xsl:call-template>
</xsl:variable>-->
<xsl:variable name="integer" as="xs:integer" select="
sum
(
for $index in 1 to string-length($value)
return
$values[$index][substring($value, $index, 1) = '1']
)"/>
<xsl:message select="$integer"/>
</xsl:template>
<xsl:function name="t:generate" as="xs:integer*">
<xsl:param name="value" as="xs:integer"/>
<xsl:sequence select="$value"/>
<xsl:sequence select="t:generate($value * 2)"/>
</xsl:function>
<!--<xsl:template name="t:generate" as="xs:integer*">
<xsl:param name="value" as="xs:integer"/>
<xsl:sequence select="$value"/>
<xsl:call-template name="t:generate">
<xsl:with-param name="value" select="$value * 2"/>
</xsl:call-template>
</xsl:template>-->
</xsl:stylesheet>
Here the logic uses such a generator and decides by itself where to break.
Should such code be valid?
From the algorithmic perspective example would better to work, as separation of
generator logic and its use are two different things.
Lately, after playing a little with saxon tree models, we thought that design
would be more cleaner and implementation faster if NamePool were implemented
differently.
Now, saxon is very pessimistic about java objects thus it prefers to encode
qualified names with integers. The encoding and decoding is done in the
NamePool. Other parts of code use these integer values.
Operations done over these integers are:
- equality comparision of two such integers in order to check whether to
qualified or extended names are equal;
- get different parts of qualified name from
NamePool.
We would design this differently. We would:
- create a
QualifiedName class to store all name parts.
- declare
NamePool to create and cache QualifiedName instances.
This way:
- equality comparision would be a reference comparision of two instances;
- different parts of qualified name would become a trivial getter;
- contention of such name pool would be lower.
That's the implementation we would propose:
QualifiedName.java,
NameCache.java
Earlier, in the entry "Inline functions in xslt 2.1" we've described an implementation of xml tree model that may share subtrees among different trees.
This way, in a code:
<xsl:variable name="elements" as="element()*" select="..."/>
<xsl:variable name="result" as="element()">
<result>
<xsl:sequence select="$elements"/>
</result>
</xsl:variable>
the implementation shares internal representation among $elements
and subtree of $result. From the perspective of xslt it
looks as completely different subtrees with different node identities, which is
in the accordance with its view of the world.
After a short study we've decided to create a research implementation of this
tree model in saxon. It's took only a couple of days to introduce a minimal changes
to engine, to refactor linked tree into a new composable tree, and to
perform some tests.
In many cases saxon has benefited immediately from this new tree model, in some
other cases more tunings are required.
Our tests've showed that this new tree performed better than
linked tree, but a little bit worser than tiny tree. On the other hand, it's
obvious that conventional code patterns avoid subtree copying, assuming it's
expensive operation, thus one should rethink some code practices to benefit from
composable tree.
Implementation can be downloaded at:
saxon.composabletree.zip
From the web we know that xslt WG is thinking now on how to make xslt more friendly to a huge documents. They will probably introduce some xslt syntax to allow implementation to be ready for a such processing.
They will probably introduce an indicator marking a specific mode for streaming. XPath in this mode will probably be restricted to a some its subset.
The funny part is that we have implemented similar feature back in 2002 in .net. It was called XPathForwardOnlyNavigator.
Implementation stored only several nodes at a time (context node and its ancestors), and read data from XmlReader perforce. Thus one could navigate to ancestor elements, to children, and to the following siblings, but never to any previous node. When one tried to reach a node that was already not available we threw an exception.
It was simple, not perfect (too restrictive) but it was pluggable in .net's xslt, and allowed to process files of any size.
That old implementation looks very attractive even now in 2010. We expect that WG with their decisions will not rule out such or similar solutions, and will not force implementers to write alternative engine for xslt streaming.
Xslt 1.0 has been designed based on the best intentions. Xslt 2.0 got a legacy
baggage.
If you're not entirely concentrated during translation of your algorithms into
xslt 2.0 you can get into trap, as we did.
Consider a code snapshot:
<xsl:variable name="elements" as="element()+">
<xsl:apply-templates/>
</xsl:variable>
<xsl:variable name="converted-elements" as="element()+"
select="$elements/t:convert(.)"/>
Looks simple, isn't it?
Our intention was to get converted elements, which result from some
xsl:apply-templates logic.
Well, this code works... but rather sporadically, as results are often in wrong
order! This bug is very close to
what is called a
Heisenbug.
So, where is the problem?
Elementary, my dear Watson:
xsl:apply-templates constructs a sequence of rootless
elements.
$elements/t:convert(.) converts elements and orders them in document order.
Here is a tricky part:
The relative order of nodes in distinct trees is stable but
implementation-dependent...
Clearly each rootless element belongs to a unique tree.
After that we have realized what the problem is, code has been
immediately rewritten as:
<xsl:variable name="elements" as="element()+">
<xsl:apply-templates/>
</xsl:variable>
<xsl:variable name="converted-elements" as="element()+" select="
for $element in $elements return
t:convert($element)"/>
P.S. Taking into an accout a size of our xslt code base, it took a half an hour to
localize the problem. Now, we're at position to review all uses of slashes
in xslt. As you like it?
Inline functions in xslt 2.1 look often as a some strange aberration. Sure,
there are very usefull cases when they are delegates of program logic
(e.g. comparators, and filters), but often (probably more often) we can see that
it's use is to model data structures.
As an example, suppose you want to model a structure with three properties say
a, b, and c. You implement this creating functions that wrap and unwrap the
data:
function make-data($a as item(), $b as item(), $c
as item()) as function() as item()+
{
function() { $a, $b, $c }
}
function a($data as function() as item()+) as item()
{
$data()[1]
}
function b($data as function() as item()+) as item()
{
$data()[2]
}
function c($data as function() as item()+) as item()
{
$data()[3]
}
Clever?
Sure, it is! Here, we have modeled structrue with the help of sequence, which we
have wrapped into a function item.
Alas, clever is not always good (often it's a sign of a bad). We just wanted to
define a simple structure. What it has in common with function?
There is a distance between what we want to express, designing an algorithm, and
what we see looking at the code. The greater the distance, the more efforts are
required to write, and to read the code.
It would be so good to have simpler way to express such concept as a structure.
Let's dream a little. Suppose you already have a structure, and just want to
access its members. An idea we can think about is an xpath like access method:
$data/a, $data/b, $data/c
But wait a second, doesn't
$data looks very like an xml element, and its accessors are just
node tests?
That's correct, so data constructor may coincide with element constructor.
Then what pros and cons of using of xml elements to model structures?
Pros are: existing xml type system, and sensibly looking code (you just
understand that here we're constructing a structure).
Cons are: xml trees are implemented the way that does not assume fast (from the
perfromace perspective) composition, as when you construct a structure a copy of
data is made.
But here we observe that "implemented" is very important word in this context.
If xml tree implementation would not store reference to the parent node then
subtrees could be composed very efficiently (note that tree is assumed to be
immutable). Parent node could be available through a tree navigator, which would
contain reference to a node itself and to a parent tree navigator (or to store child parent map somewhere near the root).
Such tree structure would probably help not only in this particular case but
also with other conventional xslt code patterns.
P.S. Saxon probably could implement its NodeInfo this way.
Update: see also Custom tree model.
A while ago we have proposed to introduce maps as built-in types in
xpath/xquery type system: Tuples and maps.
The suggestion has been declined (probably our arguments were not convincing). We,
however, still think that maps along with sequences are primitives, required to
build sensible (not obscure one) algorithms. To see that map is at times is the
best way to resolve the problems we shall refer to an utility function to
allocate names in scope. Its signature looks like this:
<!--
Allocates unique names in the form $prefix{number}?.
Note that prefixes may coincide.
$prefixes - a name prefixes.
$names - allocated names pool.
$name-max-length - a longest allowable name length.
Returns unique names.
-->
<xsl:function name="t:allocate-names" as="xs:string*">
<xsl:param name="prefixes" as="xs:string*"/>
<xsl:param name="names" as="xs:string*"/>
<xsl:param name="name-max-length" as="xs:integer?"/>
Just try to program it and you will find yourselves coding something like one defined at
cobolxom.
To be efficient such maps should provide, at worst, a logarithmic operation
complexity:
- Access to the map through a key (and/or by index) - complexity is LnN;
- Creation of a new map with added or removed item - complexity is LnN;
- Construction of the map from ordered items - complexity is O(N);
- Construction of the map from unordered items - complexity is O(N*LnN);
- Total enumeration of all items - complexity is O(N*LnN);
These performance metrics are found in many functional and procedural
implementations of the maps. Typical RB and AVL tree based maps satisfy these
restrictions.
What we suggest is to introduce map implementation into the exslt2 (assuming
inline functions are present). As a sketch we have implemented pure
AVL Tree in Xslt 2.0:
We do not assume that implementation like this should be used, but rather think
of some extension function(s) that provides a better performance.
What do you think?
The story about immutable tree would not be complete without xslt
implementation. To make it possible one needs something to approxomate tree
nodes. You cannot implement such consruct efficiently in pure xslt 2.0 (it would
be either unefficient or not pure).
To isolate the problem we have used tuple interface:
tuple:ref($items as item()*) as item() - to wrap items into a tuple;
tuple:deref($tuple as item()?) as item()* - to unwrap items from a tuple;
tuple:is-same($first as item(), $second as item()) as xs:boolean
- to test
whether two tuples are the same.
and defined inefficient implementation based on xml elements. Every other part
of code is a regular AVL algorithm implementation.
We want to stress again that even assuming that there is a good tuple
implementation we would prefer built-in associative container implementation.
Why the heck you need to include about 1000 lines of code just to use a map?
Source code is:
We like Visual Studio very much, and try to adopt new version earlier.
For the last time our VS's use pattern is centered around xml and xslt. In our opinion VS 2008 is the best xslt 2 editor we have ever seen even with lack of support of xslt 2.0 debugging.
Unfortunatelly, that is still a true when VS 2010 is almost out. VS 2008 is just 2 - 3 times faster. You can observe this working with xslt files like those in languages-xom.zip (1000 - 2000 rows). Things just become slow.
We still hope that VS 2010 will make a final effort to outdo what VS 2008 has already delivered.
Why do we return to this theme again?
Well, it's itching!
In cobolxom there is an utility function to allocate names in scope. Its signature looks like this:
<!-- Allocates unique names in the form $prefix{number}?. Note that prefixes may coincide. $prefixes - a name prefixes. $names - allocated names pool. $name-max-length - a longest allowable name length. Returns unique names. --> <xsl:function name="t:allocate-names" as="xs:string*"> <xsl:param name="prefixes" as="xs:string*"/> <xsl:param name="names" as="xs:string*"/> <xsl:param name="name-max-length" as="xs:integer?"/>
We have created several different implementations (all use recursion). Every implementation works fair for relatively small input sequences, say N < 100, but we have cases when there are about 10000 items on input. Algorithm's worst case complexity, in absence of associative containers, is O(N*N), and be sure it's an O-o-o-oh... due to xslt engine implementation.
If there were associative containers with efficient access (complexity is O(LogN)), and construction of updated container (complexity is also O(LogN)) then implementation would be straightforward and had complexity O(N*LogN).
Given:
public class N
{
public readonly N next;
}
What needs to be done to construct a ring of N: n1 refers to n2, n2 to n3, ... nk to n1? Is it possible?
To end with immutable trees, at least for now, we've implemented IDictionary<K, V>.
It's named Map<K, V>. Functionally it looks very like SortedDictionary<K, V>.
there are some differences, however:
Map in contrast to SortedDictionary is very cheap on
copy.
- Bacause
Map is based on AVL tree, which is more rigorly balanced
than RB tree, so it's a little bit faster asymptotically for lookup than SortedDictionary,
and a little bit slower on modification.
- Due to the storage structure: node + navigator,
Map consumes less memory than
SortedDictionary, and is probably cheaper for GC (simple garbage
graphs).
- As AVL tree stores left and right subtree sizes, in contrast to a "color" in
RB tree, we able to index data in two ways: with integer index, and with key
value.
Sources are:
Update:
It was impossible to withstand temptation to commit some primitive performance
comparision. Map outperforms SortedDictionary both in population and in access.
this does not aggree with pure algorithm's theory, but there might be other
unaccounted factors: memory consumption, quality of implementation, and so on.
Program.cs is updated with measurements.
Update 2:
More occurate tests show that for some key types Map's faster, for others
SortedDictionary's faster. Usually Map's slower during population (mutable AVL
tree navigator may fix this). the odd thing is that Map<string, int> is faster
than SortedDictionary<string, int> both for allocaction and for access. See
excel report.
Update 3:
Interesing observation. The following table shows maximal and
average tree heights for different node sizes in AVL and RB trees after a random population:
|
AVL |
RB |
| Size |
Max |
Avg |
Max |
Avg |
| 10 |
4 |
2.90 |
5 |
3.00 |
| 50 |
7 |
4.94 |
8 |
4.94 |
| 100 |
8 |
5.84 |
9 |
5.86 |
| 500 |
11 |
8.14 |
14 |
8.39 |
| 1000 |
12 |
9.14 |
16 |
9.38 |
| 5000 |
15 |
11.51 |
18 |
11.47 |
| 10000 |
16 |
12.53 |
20 |
12.47 |
| 50000 |
19 |
14.89 |
23 |
14.72 |
| 100000 |
20 |
15.90 |
25 |
15.72 |
| 500000 |
25 |
18.26 |
28 |
18.27 |
| 1000000 |
25 |
19.28 |
30 |
19.27 |
Here, according with theory, the height of AVL tree is shorter than the height
of RB tree. But what is most interesting is that the depth of an "average
node". This value describes a number of steps required to find a random key. RB
tree is very close and often is better than AVL in this regard.
It was obvious as hell from day one of generics that there will appear obscure
long names when you will start to parametrize your types. It was the easiest
thing in the world to take care of this in advanvce. Alas, C# inherits C++'s bad
practices.
Read Associative containers in a functional languages
and
Program.cs to see what we're talking about.
Briefly, there is a pair (string, int), which in C# should be declared as:
System.Collections.Generic.KeyValuePair<string, int>
Obviously we would like to write it in a short way. These are our attempts, which
fail:
1. Introduce generic alias Pair<K, V>:
using System.Collections.Generic;
using Pair<K, V> = KeyValuePair<K, V>;
2. Introduce type alias for a generic type with specific types.
using System.Collections.Generic;
using Pair = KeyValuePair<string, int>;
And this is only one that works:
using Pair = System.Collections.Generic.KeyValuePair<string, int>;
Do you think is it bearable? Well, consider the following:
- There is a generic type
ValueNode<T>, where T
should be Pair.
- There is a generic type
TreeNavigator<N>, where N is should be ValueNode<Pair>.
The declaration looks like this:
using Pair = System.Collections.Generic.KeyValuePair<string, int>;
using Node = NesterovskyBros.Collections.AVL.ValueNode<
System.Collections.Generic.KeyValuePair<string, int>>;
using Navigator = NesterovskyBros.Collections.AVL.TreeNavigator<
NesterovskyBros.Collections.AVL.ValueNode<
System.Collections.Generic.KeyValuePair<string, int>>>;
Do you still think is it acceptable?
P.S. Legacy thinking led C#'s and java's designers to the use of word "new" for the
object construction. It is not required at all. Consider new Pair("A", 1) vs Pair("A", 1).
C++ prefers second form. C# and java always use the first one.
Continuing with the post "Ongoing xslt/xquery spec update"
we would like to articulate what options regarding associative containers do we
have in a functional languages (e.g. xslt, xquery), assuming that variables are
immutable and implementation is efficient (in some sense).
There are three common implementation techniques:
- store data (keys, value pairs) in sorted array, and use binary search to
access values by a key;
- store data in a hash map;
- store data in a binary tree (usually RB or AVL trees).
Implementation choice considerably depends on operations, which are taken over
the container. Usually these are:
- construction;
- value lookup by key;
- key enumeration (ordered or not);
- container modification (add and remove data into the
container);
- access elements by index;
Note that modification in a functional programming means a creation of a new
container, so here is a
division:
- If container's use pattern does not include modification, then probably the
simplest solution is to build it as an ordered sequence of
pairs, and use binary search to access the data. Alternatively, one could
implement associative container as a hash map.
- If modification is essential then neither ordered sequence of pairs, hash map
nor classical tree implementation can be used, as they are either too slow
or too greedy for a memory, either during modification or during access.
On the other hand to deal with container's modifications one can build
an implementation, which uses "top-down" RB
or AVL trees. To see the
difference consider a classical tree structure and its functional variant:
|
Classical |
Functional |
| Node structure: |
node
parent
left
right
other data |
node
left
right
other data |
| Node reference: |
node itself |
node path from a root of a tree |
| Modification: |
either mutable or requires a completely new tree |
O(LnN) nodes are created
|
Here we observe that:
- one can implement efficient map (lookup time no worse than O(LnN)) with no
modification support, using ordered array;
- one can implement efficient map with support of modification, using immutable binary tree;
- one can implement all these algorithms purely in xslt and xquery (provided that inline
functions are supported);
- any such imlementation will lose against the same implementation
written in C++, C#, java;
- the best implementation would probably start from sorted array and
will switch to binary tree after some size threshold.
Here we provide a C# implementation of a functional AVL tree, which also supports
element indexing:
Our intention was to show that the usual algorithms for associative
containers apply in functional
programming; thus a feature complete functional language must support
associative containers to make development more conscious, and to free a
developer from inventing basic things existing already for almost a half of
century.
Several years ago we have started a new project. We do not like neither hate
any particular language, thus the decision what language to use was pragmatical:
xslt 2.0 fitted perfectly.
At present it's a solid volume of xslt code. It exhibits all the virtues of any
other good project in other language: clean design, modularity, documentation,
sophisticationless (good code should not be clever).
Runtime profile of the project is that it deals with xml documents with sizes
from a few dozens of bytes to several megabytes, and with xml schemas from
simple ones like a tabular data, and to rich like xhtml and untyped. Pipeline
of stylesheets processes gigabytes of data stored in the database and in files.
All the bragging above is needed here to introduce the context for the following
couple of lines.
The diversity of load conditions and a big code base, exposed xslt engine of
choice to a good proof test. The victim is Saxon. In the course of project we
have found and reported many bugs. Some of them are nasty and important, and
others are bearable. To Michael Kay's credit (he's owner of Saxon) all bugs are being
fixed promtly (see
the last one).
Such project helps us to understand a weak sides of xslt (it seems sometimes they, in WG, lack such experience, which should lead them through).
Well, it has happened so that we're helping to Saxon project. Unintentionally,
however! :-)
P.S.
About language preferences.
Nowdays we're polishing a
COBOL
generation. To this end we have familiarized ourselves with this language.
That's the beatiful language. Its straightforwardness helps to see the evolution
of computer languages and to understand what and why today's languages try to
timidly hide.
In spite of the fact that our last projects are being developed in Java, the .NET is definitly our favorite platform.
In a twitter I saw the phrase: "Java the language is a stagnant mess". It's said in favour of C#. It's true that C# significantly affects now even on Java (let's remember generics, jaxb, web services, etc.), but in my opinion, the C# won't be the leading language for worldwide enterprise applications in the nearest future.
One of causes is that the main platform for .NET still is Windows. The situation could be changed by Mono project, but I think there are yet not enough projects on platforms other than Windows.
My guess is confirmed by some of observations that I did as a software engineer of an IT company. Our company performs different software porting projects from legacy programming languages like COBOL, ADSO, Natural etc. into up to date languages like Java, C# etc. It worth to say that clients rarely select to migrate to .NET despite to our advices.
The main reason of such choice, according to most of our clients, is that they want to be platform independent and only Java gives them this choice.
It worth for Microsoft to think about cooperation with Mono in order to make .NET really platform indpendent, otherwise C# will always be a step behind Java despite apparent advantages of C# as a programming language.
Suppose you have a library, which is out in the production and is used by many clients. At the same time the library evolves: API is extended, bugs are being fixed, code becomes faster and cleaner, bla bla bla...
At some point you're fixing some important bug that's been hiding for a long time in the bowels of your library. You're happy that you've spotted it before clients got into troubles. You're notifying all the clients that there is an important fix, and that they need to update the library.
What do you think you hear in return?
Well, we're not perfect, there are bugs in our software. We and our clients realize this. Nothing will eliminate bugs to creep into a code from time to time.
That's a train of thought of a particular client:
We agree that there is a bug and that it has to be fixed. We, however, want to touch the library in a minimal way, as who knows what other new bugs had they introduced, so let's ask them to fix this particular bug in our version of the library.
That's fair from the client's perspective. They don't want better code, they just want that particular bug fixed!
For us, however, this means branching some old version of the library, fixing bug and supporting this branch for the particular client. It's fair to expect similar position from each client, thus should we create and support library branches per client, and branch a main branch for a new client only?
For us (Arthur and Vladimir) it looks as enormous waste of resources. We (our company) should either hire more and more scaled people or experience gradual slowdown of support and development.
Our answer could be obvious if not position of top managers who value client relations so much that they easily promise whatever client wishes. No arguments that latest version is better tested, more conforming to specifications, more reliable, faster and so on are accepted. The main argument against our position is that the client's applications run in the production, and no new potential bugs are acceptable.
Here is our dilemma: we neither can convince the client (more precisely our managers) that we're right, nor are convinced with their arguments...
From time to time I like to tell, to express (or even to brag about) what keeps our minds busy. Sometimes it's even useful, and you unexpectedly discover, you are on the edge of community interests, other time you are outsider.
So, what are we doing now? Well, already several years we are working in Multiconn. One of goals of our company is to help to expose mainframe functionality. Many different solutions of this task are available, and even Multiconn provides several different approaches. Now go right to our business - web service as delegate of mainframe.
To be honest, I think mainframes are legacy monsters, dinosaurs. I realize, it's only a bad perspective, but the first experience is often very strong. For us as developers mainframe is a source of data in different formats (sure formats are legacy also): COBOL records, terminal messages, and so on. Our idea is to allow to mainframe to rest in peace and to gracefully consume its data. Combining consumed data in xml form and operation bindings that describe mainframe's application flow, we arrive to a view of mainframe application as a web service.
To achieve this goal we have worked up extension to xml schema that allows mapping schema elements to a data. It should be pointed, it's perfectly legal to extend xml schema. One way to extend it, is to use elements in custom namespace in appInfo element.
The following was to create importer of schema with our extensions. It's somehow similar to xsd.exe tool. Our tool generates classes with annotations XmlXxxAttribute for xml serialization and LayoutXxxAttribute (this is our custom attributes) for instance serializing and deserializing to and from a mainframe data.
The next stage was to create data serializer. This is a counterpart to XmlSerializer. It inspects class meta-data and creates plan for serialization and deserialization.
On the next stage we have worked up wsdl schema bindings for our technology.
After that we have created tool to import wsdl (similar to wsdl.exe) that generates web service which passes input messages to communication layer that serializes and forwards data to mainframe, accepts and deserializes response and returns it to the web service, which in its turn returns result to a client.
The next was communication layer that interacts with mainframe. This layer consists of abstract (general) sublayer and specializations which support different patterns of communications.
As result we have web service implemented in .NET. This web service represents some mainframe's application.
There are plans to generate similar web services in the Java.
|