We heavily use kendo.ui.Splitter widget. Unfortunately it has several drawbacks:
- you cannot easily configure panes declaratively;
- you cannot define a pane that takes space according to its content.
Although we don't like to patch widgets, in this case we found no better
way but to patch two functions: kendo.ui.Splitter.fn._initPanes,
and kendo.ui.Splitter.fn._resize.
After the fix, splitter markup may look like the following:
<div style="height: 100%"
data-role="splitter"
data-orientation="vertical">
<div data-pane='{ size: "auto", resizable: false, scrollable: false }'>
Header with size depending on content.
</div>
<div data-pane='{ resizable: false, scrollable: true }'>
Body with size equal to a remaining area.
</div>
<div data-pane='{ size: "auto", resizable: false, scrollable: false }'>
Footer with size depending on content.
</div>
</div>
Each pane may define a data-pane attribute with pane parameters. A pane may
specify size = "auto" to take space according to its content.
The code can be found at
splitter.js A test can be seen at
splitter.html.
To simplify KendoUI development we have defined nesterovskyBros.data.Model, which extends kend.data.Model class.
Extensions in nesterovskyBros.data.Model
- As with
kendo.data.Model there is fields Object - a set of key/value pairs to configure the model fields, but fields have some more options:
fields.fieldName.serializable Boolean - indicates whether the field appears in an object returned in model.toJSON(). Default is true.
fields.fieldName.updateDirty Boolean - indicates whether the change of the property should trigger dirty field change. Default is true.
- When model defines a field and there is a prototype function with the same name then this function is used to get and set a field value.
- When property is changed through the
model.set() method then dirty change event is triggered (provided that fields.fieldName.updateDirty !== false). This helps to build a dependcy graph on that property.
- When model instance is consturcted, the data passed in are validated, nullable and default values are set.
Model example
Here is an example of a model:
nesterovskyBros.data.ProductModel = nesterovskyBros.data.Model.define(
{
fields:
{
name: { type: "string", defaultValue: "Product Name" },
price: { type: "number", defaultValue: 10 },
unitsInStockValue: { type: "number", defaultValue: 10, serializable: false },
unitsInStock: { type: "string" }
},
unitsInStock: function(value)
{
if (value === undefined)
{
var count = this.get("unitsInStockValue");
return ["one", "two", "three", "four"][count] || (count + "");
}
else
{
this.set("unitsInStockValue", ({one: 1, two: 2, three: 3, four: 4 })[value] || value);
}
}
});
Notice that:
unitsInStock property is implemented as a function - this helps to map model values to presentation values.
- when you call
model.toJSON(), or JSON.stringify() you will see in result name, price, unitsInStock values only - this helps to get model's state and to store it somewhere (e.g. in sessionStorage).
- in a code:
var model = new nesterovskyBros.data.ProductModel({ price: "7", unitsInStock: "one" });
the following is true:
(typeof(model.price) == "number") && (mode.price == 7) && (model.name == "Product Name") && (model.unitsInStockValue == 1)
As with UserControl the implemntation is defined in the controls.js. The sample page is the same index.html
Two weeks ago we've gotten new Lenovo 13" laptops (Yoga-13 with touch screens and Windows 8 Pro on board).
The first expression was WOW! Touch screens! Windows 8! Now we'll try our hand on that new (for us) API. So new, so cool...
A day later. What a shit this new UI. Where are my desktop, "Start" button, all the programs... After googling we've understood - we're not alone.
Few more days later. We've recognized that our SSD hard disk won't live long life with our projects. We generates output several GB a day. Thus we've decided to buy external SD cards - additional 64Gb, class 10. That's enough for us. No sooner said than done. After several attempts to copy our projects from hard drive to SD card (~9Gb of sources) we strongly believe that such a vigorous mix (Lenovo + Win 8 + external SD card) won't survive. Windows 8 hangs up when display off (in middle of data copy, after an hour of work). What a .... of .... this Windows 8, Lenovo and SD cards all together.
Developing with KendoUI we try to formalize tasks. With this in mind we would like to have user controls.
We define user control as following:
It is a javascript class that extends Widget.
It offers a way to reuse UI.
It allows to define a model and a template with UI and data binding.
Unfortunately, KendoUI does not have such API, though one can easily define it; so we have defined our version.
Here we review our solution. We have taken a grid KendoUI example and converted it into a user control.
User control on the page
See index.html
<!DOCTYPE html>
<html>
<head>
<title>Test</title>
<!-- (1) Include templates for controls. -->
<script src="scripts/templates.js"></script>
<script src="scripts/jquery/jquery.js"></script>
<script src="scripts/kendo/kendo.web.min.js"></script>
<!-- (2) UserControl definition. -->
<script src="scripts/controls.js"></script>
<!-- (3) Confirm dialog user control. -->
<script src="scripts/controls/confirm.js"></script>
<!-- (4) Products user control. -->
<script src="scripts/controls/products.js"></script>
<link href="styles/kendo.common.min.css" rel="stylesheet" />
<link href="styles/kendo.default.min.css" rel="stylesheet" />
<script>
$(function ()
{
// (5) Bind the page.
kendo.bind(
document.body,
// (6) Model as a datasource.
{ source: [new nesterovskyBros.data.ProductsModel] });
});
</script>
</head>
<body>
<!-- (7) User control and its binding. -->
<div data-role="products" data-bind="source: source"></div>
</body>
</html>
That's what
we see here:
- Templates that define layouts. See "How To: Load KendoUI Templates from External Files", and templates.tt.
- Definition of the UserControl widget.
- Confirm dialog user control (we shall mention it later).
- Products user control.
- Data binding that instantiates page controls.
- Model is passed to a user control through the dataSource.
- Use of Products user control. Notice that "data-role" defines control type, "source" refers to the model.
User Control declaration
Declaration consists of a view and a model.
View is html with data binding. See products.tmpl.html
We build our project using Visual Studio, so templates packaging is done with templates.tt. This transformation converts products template into a tag:
<script id="products-template" type="text/x-kendo-template">
thus template can be referred by a utility function: nesterovskyBros.template("products-template").
Model inherits kedo.data.Model. Here how it looks:
// (1) Define a ProducsModel class.
nesterovskyBros.data.ProductsModel = kendo.data.Model.define(
{
// (2) Model properties.
fields:
{
productName: { type: "string", defaultValue: "Product Name" },
productPrice: { type: "number", defaultValue: 10 },
productUnitsInStock: { type: "number", defaultValue: 10 },
products: { type: "default", defaultValue: [] }
},
// (3) Model methods.
addProduct: function ()
{
...
},
deleteProduct: function (e)
{
...
},
...
});
// (4) Register user control.
nesterovskyBros.ui.Products = nesterovskyBros.defineControl(
{
name: "Products",
model: nesterovskyBros.data.ProductsModel
});
That's what we have here:
- We define a model that inherits KendoUI Model.
- We define model fields.
- We define model methods.
- Register user control with
nesterovskyBros.defineControl(proto) call, where:
proto.name - defines user control name;
proto.model - defines model type;
proto.template - defines optional template. If not specified, a template is retrieved from $("#" + proto.name.toLowerCase() + "-template").html().
UserControl API
Now, what's remained is API for the UserControl. See controls.js.
- UserControl defines following events:
change - triggered when data source is changed;
dataBound - triggered when widget is data bound;
dataBinding - triggered befor widget data binding;
save - used to notify user to save model state.
- UserControl defines following options:
autoBind (default false) - autoBind data source;
template (default $.noop) - user control template.
- UserControl defines
dataSource field and setDataSource() method.
- UserControl defines
rebind() method to manually rebuild widget's view from the template and model.
- UserControl sets/deletes model.owner, which is a function returning a user control widget when model is bound/unbound to the widget.
- When UserControl binds/unbinds model a
model.refresh method is called, if any.
- You usually define you control with a call
nesterovskyBros.defineControl(proto). See above.
- There is also a convenience method to build a dialog based on a user control: nesterovskyBros.defineDialog(options), where
options.name - a user control name (used in the data-role);
options.model - a model type;
options.windowOptions - a window options.
This method returns a function that recieves a user control model, and returns a dialog (kendo.ui.Window) based on the user control.
Dialog has model() function that returns an instance of model.
Model has dialog() function that returns an instance of the dialog.
Dialog and model have result() function that returns an instance of deferred object used to track dialog completion.
The example of user control dialog is confirm.js and confirm.tmpl.html.
The use is in the products.js deleteProduct():
deleteProduct: function(e)
{
var that = this;
return nesterovskyBros.dialog.confirm(
{
title: "Please confirm",
message: "Do you want to delete the record?",
confirm: "Yes",
cancel: "No"
}).
open().
center().
result().
then(
function(confirmed)
{
if (!confirmed)
{
return;
}
...
});
}
Last
User controls along with technique to manage and cache templates allow us to build robust web applications. As the added value it's became a trivial task to build SPA.
Our goal is to generate reports in streaming mode.
At some point we need to deal with data streams (e.g. xml streams for xslt
transformations). Often a nature of report demands several passes through the data.
To increase performance we have defined a class named StreamResource.
This class encapsulates input data, reads it once and caches it into a temp
file; thus data can be traversed many times. StreamResource can
read data lazily or in a eager way thus releasing resources early.
This class can be used as a variation of PipeStream, which never blocks, as if
a size of a buffer is not limited, and which can be read many times.
The API
looks like this:
public class StreamResource: IDisposable
{
/// <summary>
/// Creates a StreamSource instance.
/// </summary>
/// <param name="source">
/// A function that returns source as an input stream.
/// </param>
/// <param name="settings">Optional settings.</param>
public StreamResource(Func<Stream> source, Settings settings = null);
/// <summary>
/// Creates a StreamSource instance.
/// </summary>
/// <param name="source">
/// A function that writes source data into an output stream.
/// </param>
/// <param name="settings">Optional settings.</param>
public StreamResource(Action<Stream> source, Settings settings = null);
/// <summary>
/// Gets an input stream.
/// </summary>
/// <param name="shared">
/// Indicates that this StreamResouce should be disposed when returned
/// stream is closed and there are no more currently opened cache streams.
/// </param>
/// <returns>A input stream.</returns>
public Stream GetStream(bool shared = false);
}
The use pattern is following:
// Acquire resource.
using(var resource = new StreamResource(() =>
CallService(params...)))
{
// Read stream.
using(var stream = resource.GetStream())
{
...
}
...
// Read stream again.
using(var stream = resource.GetStream())
{
...
}
}
StreamResource is efficient even if you need to process content only once, as
it monitors timings of reading of source data and compares it with timings of
data consumption. If the difference exceeds some threshold then StreamResource
caches source greedily, otherwise source is pooled lazily. Thus, input resources
can be released promptly. This is important, for example, when the source
depends on a database connection.
The use pattern is following:
// Acquire resource and get shared stream.
using(var stream = new StreamResource(() =>
CallService(params...)).GetStream(true))
{
...
}
Finally, StreamResource allows to process
data in a pipe stream mode. This is when you have a generator function
Action<Stream> that can write to a stream, and you want to read that data.
The advantage of StreamResource over real pipe stream is that it
can work without blocking of generator, thus releasing resources early.
The use pattern is similar to the previous one:
using(var stream = new StreamResource(output =>
Generate(output, params...)).GetStream(true))
{
...
}
The source of the class can be found at
Streaming.zip.
If you deal with
web applications you probably have already dealt with export data to Excel.
There are several options to prepare data for Excel:
- generate CSV;
- generate HTML that excel understands;
- generate XML in Spreadsheet 2003 format;
- generate data using Open XML SDK or some other 3rd party libraries;
- generate data in XLSX format, according to Open XML specification.
You may find a good article with pros and cons of each solution
here. We, in our turn, would like to share our experience in this field. Let's start from requirements:
- Often we have to export huge data-sets.
- We should be able to format, parametrize and to apply different styles to the exported data.
- There are cases when exported data may contain more than one table per sheet or
even more than one sheet.
- Some exported data have to be illustrated with charts.
All these requirements led us to a solution based on XSLT processing of streamed data.
The advantage of this solution is that the result is immediately forwarded to a client as fast as
XSLT starts to generate output. Such approach is much productive than generating of XLSX using of Open XML SDK or any other third party library, since it avoids keeping
a huge data-sets in memory on the server side.
Another advantage - is simple maintenance, as we achieve
clear separation of data and presentation layers. On each request to change formatting or
apply another style to a cell you just have to modify xslt file(s) that generate
variable parts of XLSX.
As result, our clients get XLSX files according with Open XML specifications.
The details of implementations of our solution see in our next posts.
We're implementing UDT changes in the big database. Earlier, that
User Defined Type was based on smallint, and now we have to use int as the base.
The impact
here is manyfold:
- Clients of the database should be prepared to use wider types.
- All stored procedures, functions, triggers, and views should be updated
accordingly.
- Impact on the database size should be analyzed.
- Types of columns in tables should be changed.
- Performance impact should be minimal.
Now, we're trying to address (3),
(5) and to implement (4), while trying to keep interface with clients using old
types.
As for database size impact, we have found that an index fragmentation is a
primary disk space waster (see Reorganize index in SQL Server).
We have performed some partial index reorganization and can see now that we can gain
back hundreds of GB of a disk space. On the other hand we use page compression, so we expect that change of types will not increase
sizes of tables considerably. Indeed, our measurments show that tables will only be
~1-3% bigger.
The change of types of columns is untrivial task. The problem is that if you try
to change column's type (which is part of clustered index) directly then you
should temporary remove foreign keys, and to rebuild all indices. This won't
work neither due to disk space required for the operation (a huge transaction
log is required), nor due to availability of tables (we're talking about days or
even weeks to rebuild indices).
To work-around the problem we have selected another way. For each target table T
we performed the following:
- Renamed table T to T_old;
- Created a table T_new with required type changes;
- Created a view named T, which is union of T_old for the dates before a split
date and T_new for the dates after the split date;
- Created instead of insert/update/delete triggers for the view T.
- Created a procedures that move data in bulks from T_old to the T_new, update
split date in view definitions, and delete data from T_old.
Note that:
- the new view uses wider column types, so we had to change stored
procedures that clients use to cast those columns back to shorter types to
prevent side effects (fortunately all access to this database is through stored
procedures and functions);
- the procedures that transfer data between new and old tables may work online;
- the quality of execution plans did not degrade due to switch from table to a
view;
- all data related to the date after the split date are inserted into T_new
table.
After transfer will be complete we shall drop T_old tables, and T views, and
will rename T_new tables into T.
This will complete part 4 of the whole task. Our estimations are that it will
take a month or even more to complete the transfer. However solution is rather
slow, the database will stay online whole this period, which is required
condition.
The next task is to deal with type changes in parameters of stored procedures
and column types of output result sets. We're not sure yet what's the best way
to deal with it, and probably shall complain about in in next posts.
We have a large table in the form:
create table dbo.Data
(
Date date not null,
Type int not null,
Value nvarchar(50) null,
primary key clustered(Date, Type)
);
create unique nonclustered index IX_Data on dbo.Data(Type, Date);
Among other queries we often need a snapshot of data per each Type for a latest
Date available:
select
max(Date) Date,
Type
from
dbo.Data
group by
Type
We have
found that the above select does not run well on our data set. In fact dbo.Data
grows with time, while snapshot we need stays more or less of the same size. The
best solution to such query is to precalculate it. One way would be to create an
indexed view, but SQL Server does not support max() aggregate in indexed views.
So, we have decided to add additional bit field dbo.Data.Last indicating that
a row belongs to a last date snapshot, and to create filtered index to access
that snapshot:
create table dbo.Data
(
Date date not null,
Type int not null,
Value nvarchar(50) null,
Last bit not null default 0,
primary key clustered(Date, Type)
);
create unique nonclustered index IX_Data on dbo.Data(Type, Date);
create unique nonclustered index IX_Data_Last on dbo.Data(Type)
include(Date)
where Last = 1;
One way to support Last indicator is to create a trigger that will adjust Last
value:
create trigger dbo.Data_Update on dbo.Data
after insert,delete,update
as
begin
if (trigger_nestlevel(@@procid) < 2)
begin
set nocount on;
with
D as
(
select Date, Type from deleted
union
select Date, Type from inserted
),
U as
(
select
V.Date, V.Type
from
D
inner join
dbo.Data V
on
(V.Last = 1) and
(V.Type = D.Type)
union
select
max(V.Date) Date,
V.Type
from
D
inner join
dbo.Data V
on
V.Type = D.Type
group by
V.Type
),
V as
(
select
rank()
over(partition by
V.Type
order by
V.Date desc) Row,
V.*
from
dbo.Data V
inner join
U
on
(V.Date = U.Date) and
(V.Type = U.Type)
)
update V
set
Last = 1 - cast(Row - 1 as bit);
end;
end;
With Last indicator in action, our original query has been transformed to:
select Date, Type
from dbo.Data where Last = 1
Execution plan shows that a new filtered index
IX_Data_Last is used. Execution speed has increased considerably.
As our actual table contains other bit fields, so Last
indicator did not
increase the table size, as SQL Server packs each 8 bit fields in one byte.
Earlier we have shown
how to build streaming xml reader from business data and have reminded about
ForwardXPathNavigator which helps to create
a streaming xslt transformation. Now we want to show how to stream content
produced with xslt out of WCF service.
To achieve streaming in WCF one needs:
1. To configure service to use streaming. Description on how to do this can be
found in the internet. See web.config of the sample
Streaming.zip for the details.
2. Create a service with a method returning Stream:
[ServiceContract(Namespace = "http://www.nesterovsky-bros.com")]
[AspNetCompatibilityRequirements(RequirementsMode = AspNetCompatibilityRequirementsMode.Allowed)]
public class Service
{
[OperationContract]
[WebGet(RequestFormat = WebMessageFormat.Json)]
public Stream GetPeopleHtml(int count,
int seed)
{
...
}
}
2. Return a Stream from xsl transformation.
Unfortunately (we mentioned it already), XslCompiledTransform generates its
output into XmlWriter (or into output Stream) rather than exposes result as
XmlReader, while WCF gets input stream and passes it to a client.
We could generate xslt output into a file or a memory Stream and then return
that content as input Stream, but this will defeat a goal of streaming, as
client would have started to get data no earlier that the xslt completed its
work. What we need instead is a pipe that form xslt output Stream to an input
Stream returned from WCF.
.NET implements pipe streams, so our task is trivial.
We have defined a utility method that creates an input Stream from a generator
populating an output Stream:
public static Stream GetPipedStream(Action<Stream> generator)
{
var output = new AnonymousPipeServerStream();
var input = new AnonymousPipeClientStream(
output.GetClientHandleAsString());
Task.Factory.StartNew(
() =>
{
using(output)
{
generator(output);
output.WaitForPipeDrain();
}
},
TaskCreationOptions.LongRunning);
return input;
}
We wrapped xsl transformation as such a generator:
[OperationContract]
[WebGet(RequestFormat = WebMessageFormat.Json)]
public Stream GetPeopleHtml(int count, int seed)
{
var context = WebOperationContext.Current;
context.OutgoingResponse.ContentType = "text/html";
context.OutgoingResponse.Headers["Content-Disposition"] =
"attachment;filename=reports.html";
var cache = HttpRuntime.Cache;
var path = HttpContext.Current.Server.MapPath("~/People.xslt");
var transform = cache[path] as XslCompiledTransform;
if (transform == null)
{
transform = new XslCompiledTransform();
transform.Load(path);
cache.Insert(path, transform, new CacheDependency(path));
}
return Extensions.GetPipedStream(
output =>
{
// We have a streamed business data.
var people = Data.CreateRandomData(count, seed, 0, count);
// We want to see it as streamed xml data.
using(var stream =
people.ToXmlStream("people", "http://www.nesterovsky-bros.com"))
using(var reader = XmlReader.Create(stream))
{
// XPath forward navigator is used as an input source.
transform.Transform(
new ForwardXPathNavigator(reader),
new XsltArgumentList(),
output);
}
});
}
This way we have build a code that streams data directly from business data to a
client in a form of report. A set of utility functions and classes helped us to
overcome .NET's limitations and to build simple code that one can easily
support.
The sources can be found at
Streaming.zip.
In the previous
post about streaming we have dropped at the point where we have XmlReader
in hands, which continously gets data from IEnumerable<Person>
source.
Now we shall remind about ForwardXPathNavigator - a class we have built
back in 2002, which adds streaming transformations to .NET's xslt processor.
While XslCompiledTransform is desperately obsolete, and no upgrade
will possibly follow; still it's among the fastest xslt 1.0 processors. With
ForwardXPathNavigator we add ability to transform input data of arbitrary size to this processor.
We find it interesting that
xslt 3.0 Working Draft defines streaming processing in a way that closely
matches rules for ForwardXPathNavigator:
Streaming achieves two important objectives: it allows large documents to be transformed
without requiring correspondingly large amounts of memory; and it allows the processor
to start producing output before it has finished receiving its input, thus reducing
latency.
The rules for streamability, which are defined in detail in 19.3 Streamability
Analysis, impose two main constraints:
-
The only nodes reachable from the node that is currently being processed are its
attributes and namespaces, its ancestors and their attributes and namespaces, and
its descendants and their attributes and namespaces. The siblings of the node, and
the siblings of its ancestors, are not reachable in the tree, and any attempt to
use their values is a static error. However, constructs (for example, simple forms
of xsl:number, and simple positional patterns) that require knowledge
of the number of preceding elements by name are permitted.
-
When processing a given node in the tree, each descendant node can only be visited
once. Essentially this allows two styles of processing: either visit each of the
children once, and then process that child with the same restrictions applied; or
process all the descendants in a single pass, in which case it is not possible while
processing a descendant to make any further downward selection.
The only significant difference between ForwardXPathNavigator and
xlst 3.0 streaming is in that we reported violations of rules for streamability
at runtime, while xslt 3.0 attempts to perform this analysis at compile time.
Here the C# code for the xslt streamed transformation:
var transform = new XslCompiledTransform();
transform.Load("People.xslt");
// We have a streamed business data.
var people = Data.CreateRandomData(10000, 0, 0, 10000);
// We want to see it as streamed xml data.
using(var stream =
people.ToXmlStream("people", "http://www.nesterovsky-bros.com"))
using(var reader = XmlReader.Create(stream))
using(var output = File.Create("people.html"))
{
// XPath forward navigator is used as an input source.
transform.Transform(
new ForwardXPathNavigator(reader),
new XsltArgumentList(),
output);
}
Notice how XmlReader is wrapped into ForwardXPathNavigator.
To complete the picture we need xslt that follows the streaming rules:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:d="http://www.nesterovsky-bros.com"
exclude-result-prefixes="msxsl d">
<xsl:output method="html" indent="yes"/>
<!-- Root template processed in the streaming mode. -->
<xsl:template match="/d:people">
<html>
<head>
<title>List of persons</title>
<style type="text/css">
.even
{
}
.odd
{
background: #d0d0d0;
}
</style>
</head>
<body>
<table border="1">
<tr>
<th>ID</th>
<th>First name</th>
<th>Last name</th>
<th>City</th>
<th>Title</th>
<th>Age</th>
</tr>
<xsl:for-each select="d:person">
<!--
Get element snapshot.
A
snapshot allows arbitrary access to the element's content.
-->
<xsl:variable name="person">
<xsl:copy-of select="."/>
</xsl:variable>
<xsl:variable name="position" select="position()"/>
<xsl:apply-templates mode="snapshot" select="msxsl:node-set($person)/d:person">
<xsl:with-param name="position" select="$position"/>
</xsl:apply-templates>
</xsl:for-each>
</table>
</body>
</html>
</xsl:template>
<xsl:template mode="snapshot" match="d:person">
<xsl:param name="position"/>
<tr>
<xsl:attribute name="class">
<xsl:choose>
<xsl:when test="$position mod 2 = 1">
<xsl:text>odd</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>even</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:attribute>
<td>
<xsl:value-of select="d:Id"/>
</td>
<td>
<xsl:value-of select="d:FirstName"/>
</td>
<td>
<xsl:value-of select="d:LastName"/>
</td>
<td>
<xsl:value-of select="d:City"/>
</td>
<td>
<xsl:value-of select="d:Title"/>
</td>
<td>
<xsl:value-of select="d:Age"/>
</td>
</tr>
</xsl:template>
</xsl:stylesheet>
So, we have started with a streamed entity data, proceeded to the streamed
XmlReader and reached to the streamed xslt transformation.
But at the final post about streaming we shall remind a simple way of building
WCF service returning html stream from our xslt transformation.
The sources can be found at
Streaming.zip.
For some reason neither .NET's XmlSerializer nor DataContractSerializer allow
reading data through an XmlReader. These APIs work other way round writing data
into an XmlWriter. To get data through XmlReader one needs to write it to some
destination like a file or memory stream, and then to read it using XmlReader.
This complicates streaming design considerably.
In fact the very same happens with other .NET APIs.
We think the reason of why .NET designers preferred XmlWriter to XmlReader in
those APIs is that XmlReader's implementation is a state machine like, while
XmlWriter's implementation looks like a regular procedure. It's much harder to
manually write and to support a correct state machine logic
than a procedure.
If history would have gone slightly
different way, and if yield return, lambda, and Enumerator API appeared before
XmlReader, and XmlWriter then, we think, both these classes looked differently.
Xml source would have been described with a IEnumerable<XmlEvent> instead of
XmlReader, and XmlWriter must be looked like a function receiving
IEnumerable<XmlEvent>. Implementing XmlReader would have meant a creating a
enumerator. Yield return and Enumerable API would have helped to implement it in
a procedural way.
But in our present we have to deal with the fact that DataContractSerializer
should write the data into XmlWriter, so let's assume we have a project that
uses Entity Framework to access the database, and that you have a data class
Person, and data access method GetPeople():
[DataContract(Name = "person", Namespace = "http://www.nesterovsky-bros.com")]
public class Person
{
[DataMember] public int Id { get; set; }
[DataMember] public string FirstName { get; set; }
[DataMember] public string LastName { get; set; }
[DataMember] public string City { get; set; }
[DataMember] public string Title { get; set; }
[DataMember] public DateTime BirthDate { get; set; }
[DataMember] public int Age { get; set; }
}
public static IEnumerable<Person> GetPeople() { ... }
And your goal is to expose result of GetPeople() as XmlReader.
We achieve result with three simple steps:
- Define
JoinedStream - an input Stream implementation that
reads data from a enumeration of streams (IEnumerable<Stream>).
- Build xml parts in the form of
IEnumerable<Stream>.
- Combine parts into final xml stream.
The code is rather simple, so here we qoute its essential part:
public static class Extensions
{
public static Stream JoinStreams(this IEnumerable<Stream> streams, bool closeStreams = true)
{
return new JoinedStream(streams, closeStreams);
}
public static Stream ToXmlStream<T>(
this IEnumerable<T> items,
string rootName = null,
string rootNamespace = null)
{
return items.ToXmlStreamParts<T>(rootName, rootNamespace).
JoinStreams(false);
}
private static IEnumerable<Stream> ToXmlStreamParts<T>(
this IEnumerable<T> items,
string rootName = null,
string rootNamespace = null)
{
if (rootName == null)
{
rootName = "ArrayOfItems";
}
if (rootNamespace == null)
{
rootNamespace = "";
}
var serializer = new DataContractSerializer(typeof(T));
var stream = new MemoryStream();
var writer = XmlDictionaryWriter.CreateTextWriter(stream);
writer.WriteStartDocument();
writer.WriteStartElement(rootName, rootNamespace);
writer.WriteXmlnsAttribute("s", XmlSchema.Namespace);
writer.WriteXmlnsAttribute("i", XmlSchema.InstanceNamespace);
foreach(var item in items)
{
serializer.WriteObject(writer, item);
writer.WriteString(" ");
writer.Flush();
stream.Position = 0;
yield return stream;
stream.Position = 0;
stream.SetLength(0);
}
writer.WriteEndElement();
writer.WriteEndDocument();
writer.Flush();
stream.Position = 0;
yield return stream;
}
private class JoinedStream: Stream
{
public JoinedStream(IEnumerable<Stream> streams, bool closeStreams = true)
...
}
}
The use is even more simple:
// We have a streamed business data.
var people = GetPeople();
// We want to see it as streamed xml data.
using(var stream = people.ToXmlStream("persons", "http://www.nesterovsky-bros.com"))
using(var reader = XmlReader.Create(stream))
{
...
}
We have packed the sample into the project
Streaming.zip.
In the next post we're going to remind about streaming processing in xslt.
We're pleased to work with Kendo UI. Its design is good, however we find here
and there things we would wish be done better. Here is a list of problems in a
no particular order we
would like to be addressed in the next release:
- RTL is not supported (including correct scroll bar position see
Tunning KendoUI).
- Templates and binding should support a context information along with the data
source. (Why do they use
with statement?)
- attr binding should use jquery.attr() method; there should be prop binding
which is analogous to attr binding.
- There should be custom binding that allows any json object to bind to different
aspects of a widget or an element.
- One should be able to use format/parse functions during binding. (Allow
binding to express as a triple json object?)
parseExact(value, format, culture) method should be rewritten, as it has
nothing in common with parsing data string according to exact format.
- Type inference during binding is poor (
parseOption() method). It works neither for string "1,2", nor json " { x: 0 } ", nor for date.
- Binding is not implemented for many components: splitter, grid.
- Splitter's pane should support size="auto".
- Drid does not support totals in group headers, nor it supports header
selection.
- DataSource does not works after remote error, neither it allows to cancel
request.
- innerHtml is used all over the code, thus one cannot rely on jquery.data().
- Grid does not support customization (localization) of a column filter.
- Grid should support data binding of its content.
- One should be able to destroy any widget.
Some time ago we were taking a part in a project where 95% of all sources are xslt
2.0. It was a great experience for us.
The interesting part is that we used xslt in areas we would never expect
it in
early 2000s. It crunched gigabytes of data in offline, while earlier we
generally sought xslt application in a browser or on a server as an engine to render the data.
Web applications (both .NET and java) are in our focus today, and it became hard
to find application for xslt or xquery.
Indeed, client side now have a very strong APIs: jquery, jqueryui, jsview,
jqgrid, kendoui, and so on. These libraries, and today's browsers cover
developer's needs in building managable applications. In contrast, a native
support of xslt (at least v2) does not exist in browsers.
Server side at present is seen as a set of web services. These services
support both xml and json formats, and implement a business logic only. It
would be a torture to try to write such a frontend in xslt/xquery. A server
logic itself is often dealing with a diversity of data sources like databases,
files (including xml files) and other.
As for a database (we primarily work with SQL Server 2008 R2), we think that all
communication should go through stored procedures, which implement all data
logic. Clearly, this place is not for xslt. However, those who know sql beyond
its basics can confirm that sql is very similar to xquery. More than that SQL
Server (and other databases) integrate xquery to work with xml data, and we do
use it extensively.
Server logic itself uses API like LINQ to manipulate with different data
sources. In fact, we think that one can build a compiler from xquery 3.0 to C#
with LINQ. Other way round compiler would be a whole different story.
The net result is that we see little place for xslt and xquery. Well, after all it's only a personal perspective on the subject.
The similar type of thing has happened to us with C++. As with xslt/xquery we
love C++ very much, and we fond of C++11, but at present we have no place in our
current projects for C++. That's pitty.
P.S. Among other things that play against xslt/xquery is that there is a shortage of people who know these languages, thus who can support such projects.
Let's start from a distance.
We support a busy database for a customer. Customer's requirement (in fact,
state's requirement) is that the database should have audit logs. This
means that all important requests should be logged. These logs help both for the
offline security analysis, and for the database health monitoring.
Before the end of the last year we used SQL Server 2005, and then customer
has upgraded to SQL Server 2008 R2.
As
by design the database is accessed through Stored Procedures only, so the logging was done
using
a small SP that traced input parameters and execution time. The call to that SP
was inserted throughout the code of other SPs.
We expected SQL Server 2008 R2
to simplify the task, and to allow us to switch
the audit on and off on a fine grained level without the need to change a SP in
the production (see
Understanding
SQL Server Audit for details).
Unfortunatelly, we have almost immediately found that
the current audit implementation traces SP calls but does not store parameter values. This way, you can see that
there was a call "execute X @param1, @param2", but you
have no idea what values were passed. Internet search shows that this a known problem (see
SQL Server 2008 Database Audit on INSERT UPDATE and DELETE actual SQL and not
parameter values), which renders SQL Server Audit useless.
But nevertheless, looking at how can we simplify our hand-made audit we have found a
brilliant solution: "Light
weight SQL Server procedure auditing without using SQL Server auditing".
It's so simple, that it's a shame that we did not invent it ourselves!
The approach is to insert or remove tracing code automatically. Indeed, there is
nothing but data in the database, even the text of SP is only a data.
To automate it even more, we have defined a small table with names of procedures
and their log levels, and have defined a procedure "Log.SetLevel @level" to configure
all logging in one go. In addition we have simplified logging procedures and
tables, and started to store parameters in xml columns rather than in a
pipe-concatenated strings.
Now, to the negative SP execution times.
The logging code among other things measures current_timestamp at
the begin and at the end of the execution of SP. This helps us (as developers)
to monitor how database performs on a day to day basis, and to build many useful
statistics.
For example we can see that the duration of about 10% of untrivial selects is
0ms (execution time is under 1ms). This means that SQL Server is good at
data caching. But what is most interesting is that about 0.1% of requests have
negative duration!
You could speculate
on parallel or on out of order execution, but the paradox is
resolved when you look closely on a value of duration. It's always around of -7,200,000ms. No
one will assume that execution has ended two hours before it has started. So,
what does it mean -2 hours? Well, we live in (UTC+02:00) Jerusalem time zone. We think that UTC offset crawls somehow into the result. To prove our
hypothesis we would like to change time zone on sql servers, but customer won't agree on
such an experiment. :-)
This effect probably means that there is some hidden bug in SQL Server 2008 R2 that we cannot
reliably reproduce, but we can see that the datediff(ms,
start_timestamp, end_timestamp) may return negative value when it's known
that start_timestamp is acquired before end_timestamp.
Update: What a shame. During tunning of the original logging procedures we have changed type from datetime to datetime2, and calls from GETUTCDATE() to current_timestamp, except one place (default value in the table definition) where it remained with GETUTCDATE().
So, negative durations meant operation timeout (in our case duration is greater than 30 secs).
After C++11 revision has been approved a new cycle of C++ design has begun:
N3370:
The C++ standards committee is soliciting proposals for additional library components.
Such proposals can range from small (addition of a single signature to an existing
library) to large (something bigger than any current standard library component).
At this stage it's interesting to read papers, as authors try to express ideas
rather than to formulate sentences that should go into spec as it lately was.
These are just several papers that we've found interesting:
|
N3322
|
12-0012
|
A Preliminary Proposal for a Static if
|
Walter E. Brown
|
|
N3329
|
12-0019
|
Proposal: static if declaration
|
H. Sutter, W. Bright, A. Alexandrescu
|
Those proposals argue about compile time "if statement". The feature can replace #if preprocessor
directive, a SFINAE or in some cases template specializations.
A static if declaration can appear wherever a declaration or a statement is
legal.
Authors also propose to add static if clause to a class and a function declarations
to conditionally exclude them from the scope.
Examples:
// Compile time factorial.
template <unsigned n>
struct factorial
{
static if (n <= 1)
{
enum : unsigned { value = 1 };
}
else
{
enum : unsigned { value = factorial<n - 1>::value * n };
}
};
// Declare class provided a condition is true.
class Internals if (sizeof(void*) == sizeof(int));
Paper presents strong rationale why this addition helps to build better
programs, however the questions arise about relations between static if
and concepts, static if clause and an error diagnostics.
|
N3327 |
12-0017 |
A Standard Programmatic Interface for Asynchronous Operations |
N. Gustafsson, A. Laksberg |
|
N3328 |
12-0018 |
Resumable Functions |
Niklas Gustafsson |
That's our favorite.
Authors propose an API and a language extensions to make asynchronous programs
simpler.
In fact, asynchronous function will look very mush as a regular one but with small
additions. It's similar to yield return in C# (a
construct that has been available in C# for many years and is well vetted),
and to async expression in C# 4.5. Compiler will rewrite such a
function into a state machine, thus function can suspend its execution, wait
for the data and to resume when data is available.
Example:
// read data asynchronously from an input and
write it into an output.
int cnt = 0;
do
{
cnt = await streamR.read(512, buf);
if (cnt == 0)
{
break;
}
cnt = await streamW.write(cnt, buf);
}
while(cnt > 0);
It's iteresting to see how authors will address yield return: either with
aditional keyword, or in terms of resumable functions.
|
N3340 |
12-0030 |
Rich Pointers |
D. M. Berris, M. Austern, L. Crowl |
Here authors try to justify rich type-info but mask it under the
name "rich pointers". To make things even more obscure they argue about dynamic code
generation.
If you want a rich type-info then you should talk about it and not about thousand
of other things.
We would better appealed to create a standard API to access post-compile object
model, which could be used to produce different type-infos or other source
derivatives.
This paper is our outsider. :-)
|
N3341 |
12-0031 |
Transactional Language Constructs for C++ |
M. Wong, H. Boehm, J. Gottschlich, T. Shpeisman, et al. |
Here people try to generalize (put you away from) locking, and replace it with
other word "transaction".
Seems it's not viable proposition. It's better to teach on functional style of
programming with its immutable objects.
|
N3347 |
12-0037 |
Modules in C++ (Revision 6) |
Daveed Vandevoorde |
Author argues against C style source composition with #include
directive, and propose alternative called "modules".
We think that many C++ developers would agree that C pre-processor is a legacy that
would never have existed, but for the same reason (for the legacy, and compatibility) it
should stay.
In out opinion the current proposition is just immature, at least it's not
intuitive. Or in other words there should be something to replace the C
pre-processor (and #include as its part), but we don't like this
paper from aestetic perspective.
|
N3365 |
12-0055 |
Filesystem Library Proposal (Revision 2) |
This proposal says no a word about asynchronous nature of file access, while it
should be designed around it.
For a long time we were developing web applications with ASP.NET and JSF. At
present we prefer rich clients and a server with page templates and RESTful web
services.
This transition brings technical questions. Consider this one.
Browsers allow to store session state entirely on the client, so should we
maintain a session on the server?
Since the server is just a set of web services, so we may supply all required
arguments on each call.
At first glance we can assume that no session is required on the server.
However, looking further we see that we should deal with data validation
(security) on the server.
Think about a classic ASP.NET application, where a user can select a value from
a dropdown. Either ASP.NET itself or your program (against a list from a
session) verifies that the value received is valid for the user. That list of
values and might be other parameters constitute a user profile, which we stored
in session. The user profile played important role (often indirectly) in the
validation of input data.
When the server is just a set of web services then we have to validate all
parameters manually. There are two sources that we can rely to: (a)
a session, (b)
a user principal.
The case (a) is very similar to classic ASP.NET application except that with
EnableEventValidation="true" runtime did it for us most of the time.
The case (b) requires reconstruction of the user profile for a user principal
and then we proceed with validation of parameters.
We may cache user profile in session, in which case we reduce (b) to (a); on the
other hand we may cache user profile in
Cache, which is also similar to (a) but which might be lighter than (at least not
heavier than) the solution with the session.
What we see is that the client session does not free us from server session (or
its alternative).
We were dealing with a datasource of (int? id, string
value) pairs in LINQ. The data has originated from a database where id is unique field.
In the program this datasource had to be seen as a dictionary, so we have
written a code like this:
var dictionary =
CreateIDValuePairs().ToDictionary(item => item.ID, item => item.Value);
That was too simple-minded. This code compiles but crashes at runtime when there is an id == null.
Well, help warns about this behaviour, but anyway this does not make pain easier.
In our opinion this restriction is not justified and just complicates the use
of Dictionaty.
A database we support for a client contains multi-billion row tables. Many
users query the data from that database, and it's permanently populated
with a new data.
Every day we load several millions rows of a new data. Such loads can lock tables for a
considerable time, so our loading procedures collect new data into intermediate
tables and insert it into a final destination by chunks, and usually after work
hours.
SQL Server 2008 R2 introduced
READ_COMMITTED_SNAPSHOT database option. This feature trades locks for an
increased tempdb size (to store row versions) and possible performance
degradation during a transaction.
When we have switched the database to that option we did
not notice any considerable performance change. Encouraged, we've decided to
increase size of chunks of data we insert at once.
Earlier we have found that when we insert no more than 1000 rows
at once, users don't notice impact, but for a bigger chunk sizes users start to
complain on performance degradation. This has probably happened due to locks
escalations.
Now, with chunks of 10000 or even 100000 rows we have found that no queries
became slower. But load process became several times faster.
We were ready to pay for increased tempdb and transaction log size to increase
performance, but in our case we didn't approach limits assigned by the DBA.
Another gain is that we can easily load data at any time. This makes data we
store more up to date.
Yesterday, by accident, we've seen an article about some design principles
of V8 JavaScript Engine.
It made clearer what techniques are used in today's script implementations.
In particular V8 engine optimizes property access using "dynamically
created
hidden classes". These are structures to store object's layout, they
are derived when ש new property is created (deleted) on the object. When code
accesses a property, and if a cached object's dynamic hidden class is available
at the code point then access time is comparable to one of native fields.
In our opinion this tactics might lead to a proliferation of such dynamic hidden
classes, which requires a considerable housekeeping, which also slows property
write access, especially when it's written for the first time.
We would like to suggest a slightly different strategy, which exploits the
cache matches, and does not require a dynamic hidden classes.
Consider an implementation data type with following characteristics:
- object is implemented as a hash map of property id to property value: Map<ID, Value>;
- it stores data as an array of pairs and can be accessed directly: Pair<ID,
Value> values[];
- property index can be acquired with a method: int index(ID);
A pseudo code for the property access looks like this:
pair = object.values[cachedIndex];
if (pair.ID == propertyID)
{
value = pair.Value;
}
else
{
// Cache miss.
cachedIndex = object.index(propertyID);
value = objec.values[cachedIndex].Value;
}
This approach brings us back to dictionary like implementation but with
important optimization of array speed access when property index is cached, and
with no dynamic hidden classes.
@michaelhkay Saxon 9.4 is out.
But why author does not state that HE version is still xslt/xpath 2.0, as neither xslt maps, nor function items are supported.
Recently, we have found and reported the bug in the SQL Server 2008 (see SQL Server 2008 with(recompile), and also Microsoft Connect).
Persons, who's responsible for the bug evaluation has closed it, as if "By Design". This strange resolution, in our opinion, says about those persons only.
Well, we shall try once more (see Microsoft Connect). We have posted another trivial demonstartion of the bug, where we show that option(recompile) is not used, which leads to table scan (nothing worse can happen for a huge table).
A bit history: the first release of this solution was about 9.5 years ago...
Today we've run into a strange situation. One of our clients ask us about automatic conversion of data from mainframe (that were defined as COBOL copybooks) into XML or Java/.NET objects. On our suggestion to use eXperanto, which is well known to him, he stated that he wouldn't like to use a tool of a company that is no more exists...
The situation, in our opinion, become more strange when you consider the following:
- eXperanto (the design-time tool and run-time libraries for Java and .NET) were developed, well tested, and delivered by us to production already several years ago.
- the client bought this set (the tool and libraries).
- the set is in production yet already in another big company, and is used time to time by our company in different migration projects.
- the client talks with developers of this tool and run-time libraries, and he knows about this fact.
- the client uses widely open source solutions even without dedicated vendors or support warranties.
It has happened so, that we have never worked with jQuery, however were aware of
it.
In early 2000 we have developed a web application that contained rich javascript
APIs, including UI components. Later, we were actively practicing in ASP.NET, and
later in JSF.
At present, looking at jQuery more closely we regret that we have failed to
start using it earlier.
Separation of business logic and presentation is remarkable when one uses JSON
web services. In fact server part can be seen as a set of web services
representing a business logic and a set of resources: html, styles, scripts,
others. Nor ASP.NET or JSF approach such a consistent separation.
The only trouble, in our opinion, is that jQuery has no standard data binding: a way to bind JSON data
to (and from) html controls. The technique that will probably be standardized is called jQuery Templates or JsViews
.
Unfortunatelly after reading about this
binding API, and
being in love with Xslt and XQuery we just want to cry. We don't know what would
be the best solution for the task, but what we see looks uncomfortable to us.
Incidentally, we have found one new implementation of yield return in java that is
in the development stage. Sources can be found at
https://github.com/peichhorn/lombok-pg/zipball/master. Just to be sure we
have copied those sources at other place
peichhorn-lombok-pg-0.10.0-39-g384fb7b.zip (you may search "yield"
in the archive).
It's broken according to source tracker, but the funny thing is that sources,
however different, still resemble our yield return implementation (Yield.jar,
Yield.3.7.jar
- Indigo, Yield.zip
- sources) very much: variable names, error messages, algorithmic structure.
Those programmers probably have forgotten good manners: to reference a base work,
at least.
Well, we generously forgive them this blunder.
P.S. our implementation, in contrast, works without bugs.
P.P.S. misunderstanding is resolved. See comments.
A couple of weeks ago, we have suggested to introduce a enumerator function into
the XPath (see
[F+O30] A enumerator function):
I would like the WG to consider an addition of a function that turns a sequence
into a enumeration of values.
Consider a function like this:
fn:enumerator($items as item()*) as function() as item()?;
alternatively, signature could be:
fn:enumerator($items as function() as item()*) as function() as item()?;
This function receives a sequence, and returns a function item, which upon N's
call shall return N's element of the original sequence. This way, a sequence of
items is turned into a function providing a enumeration of items of the
sequence.
As an example consider two functions:
a) t:rand($seed as xs:double) as xs:double* - a function producing a random
number sequence;
b) t:work($input as element()) as element() - a function that generates output
from it's input, and that needs random numbers in the course of the execution.
t:work() may contain a code like this:
let $rand := fn:enumerator(t:rand($seed)),
and later it can call $rand() to get a random numbers.
Enumerators will help to compose algorithms where one algorithm communicate
with other independant algorithms, thus making code simpler. The most obvious
class of enumerators are generators: ordered numbers, unique identifiers,
random numbers.
Technically, function returned from fn:enumerator() is nondetermenistic, but its "side effect" is
similar to a "side effect" of a function generate-id() from a newly created
node (see bug #13747, and bug #13494).
The idea is inspired by a generator function, which returns a new value upon each
call.
Such function can be seen as a stateful object. But our goal is to look at
it in a more functional way. So, we look at the algorithm as a function that
produces a sequence of output, which is pure functional; and an enumerator that
allows to iterate over algorithm's output.
This way, we see the function that implements an algorithm and the function that
uses it can be seen as two thread of functional programs that use messaging to
communicate to each other.
Honestly, we doubt that WG will accept it, but it's interesting to watch the
discussion.
More than month has passed since we have reported a problem to the saxon forum (see
Saxon optimizer bug and
Saxon 9.2 generate-id() bug).
The essence of the problem is that we have constructed argumentless function to
return a unique identifiers each time function is called. To achieve the effect
we have created a temporary node and returned its generate-id() value.
Such a function is nondetermenistic, as we cannot state that its result depends
on arguments only. This means that engine's optimizer is not free to reorder
calls to such a function. That's what happens in Saxon 9.2, and Saxon 9.3 where
engine elevates function call out of cycle thus producing invalid results.
Michael Kay, the author of the Saxon engine, argued that this is "a gray area of
the xslt spec":
If the spec were stricter about defining exactly when you can rely on identity-dependent
operations then I would be obliged to follow it, but I think it's probably deliberate
that it currently allows implementations some latitude, effectively signalling to
users that they should avoid depending on this aspect of the behaviour.
He adviced to raise a bug in the w3c bugzilla to resolve the issue. In the end
two related bugs have been raised:
- Bug 13494
- Node
uniqueness returned from XSLT function;
- Bug 13747
- [XPath 3.0] Determinism of expressions returning constructed nodes.
Yesterday, the WG has resolved the issue:
The Working Group agreed that default behavior should continue to require these
nodes to be constructed with unique IDs.
We believe that this is the kind of thing implementations can do with
annotations or declaration options, and it would be best to get implementation
experience with this before standardizing.
This means that the technique we used to generate unique identifiers is correct
and the behaviour is well defined.
The only problem is to wait when Saxon will fix its behaviour accordingly.
We're not big fans of
Entity Framework, as we don't directly expose the database structure to
the client program but rather through stored procedures and functions. So, EF for
us is a tool to expose those stored procedures as .NET wrappers. This limited use
of EF still greatly automates the data access code.
But what we have lately found is that the EF has a problem with char parameters. Namely,
if you import a procedure say MyProc that accepts char(1),
and then will call it through the generated wrapper, the you will see in sql profiler
that char(1) parameter is passed with many trailing spaces as if it
were char(8000). There isn't necessity to prove that this is highly
ineffective.
We can see that the problem happens in VS 2010 designer rather than in the EF runtime,
as SP's parameters are not attributed with length, see model xml (*.edmx):
<Function Name="MyProc" Schema="Data">
...
<Parameter Name="recipientType" Type="char" Mode="In"
/>
...
</Function>
while if we set:
<Parameter Name="recipientType" Type="char" MaxLength="1"
Mode="In" />
the runtime starts working as expected. So the workaround is to fix model file manually.
See also:
Stored Proc and Char parm
1. query.dll vs tquery.dll
We have installed
Windows Search 4 on a Windows 2003 server. The goal was to index huge compressed
xml files (see
Windows Search Notifications). But for some reason it did not want to index
content.
No "select System.ItemUrl from SystemIndex where contains('...')"
has ever returned a row.
We thought that the problem was in our protocol handler, and tried to localize it,
but finally have discovered that Windows Search is not able to find anything within
text files.
Registry comparision has shown that *.txt extension was indexed by the IFilter defined
in the query.dll, while on the other computers, where everything worked, the implementation
was in the tquery.dll.
Both libraries were present on the Windows 2003 server, so we have corrected the
registry and everything has started to work.
As far as we understand query.dll is part of legacy
Indexing Service, and tquery.dll is up to date implementation.
2. Search index size
We have to index a considerable amout of data. But before we can do it we have to
estimate the size of index.
In the past it seems we saw somewhere a statement that search index needs a storage
that's about 10% of original data for its purposes. Unfortunatelly we cannot
find this estimation at present, neither we cannot find any other estimation. This
complicates our planning.
To get empirical estimate we've indexed several thousands *.xml-gz files, which
are gz'ed big xmls. The total size of this files is about 4.5GB. Total uncompressed
size of xmls ~50GB. Xml contained about 10 millions pages of data.
According to 10% criteria we had to arrive to ~5GB search index.
But what we have discovered is that the index has grown to more than 50GB. That's
very disappointing. We cannot afford such expense, as we've commited test on
a tiny part of data, which increases over time.
So, the solution is to find out what's wrong, and how can it be cured, or to
fulltext index only most recent subset of data.
P.S. We have tried to mark folder with search index as compressed, but it did not
work.
P.P.S. We have found the reference to Windows Search 4 index size estimation. It is in
Windows Search Frequently Asked Questions, see answer on "What is average size of a user's index?" question.
We needed to track a stream position during creation of xml file. This is to
allow random access to a huge xml file (the task is related to
WindowsSearch).
This is a simplified form of the xml:
<data>
<item>...</item>
...
<item>...</item>
</data>
The goal was to have stream position of each item element. With this in mind,
we've decided to:
- open a stream, and then xml writer over it;
- write data into xml writer;
- call
Flush() method of the xml writer before measuring stream offset;
That's a code sample:
var stream = new MemoryStream();
var writer = XmlWriter.Create(stream);
writer.WriteStartDocument();
writer.WriteStartElement("data");
for(var i = 0; i < 10; ++i)
{
writer.Flush();
Console.WriteLine("Flush offset: {0}, char: {1}",
stream.Position,
(char)stream.GetBuffer()[stream.Position - 1]);
writer.WriteStartElement("item");
writer.WriteValue("item " + i);
writer.WriteEndElement();
}
writer.WriteEndElement();
writer.WriteEndDocument();
That's the output:
Flush offset: 46, char: a
Flush offset: 66, char: >
Flush offset: 85, char: >
Flush offset: 104, char: >
Flush offset: 123, char: >
Flush offset: 142, char: >
Flush offset: 161, char: >
Flush offset: 180, char: >
Flush offset: 199, char: >
Flush offset: 218, char: >
Funny, isn't it?
After feeding the start tag <data>, and flushing xml writer we observe that only
"<data" has been written down to the stream. Well,
Flush() have never promissed anything particular about the content
of the stream, so we cannot claim any violation, however we expected to see
whole start tag.
Inspection of the implementation of xml writer reveals laziness during writting
data down the stream. In particular start tag is closed when one starts the
content. This is probably to implement empty tags: <data/>.
To do the trick we had to issue empty content, moreover, to call a particular
method with particular parameters of the xml writer. So the code after the fix
looks like this:
var stream = new MemoryStream();
var writer = XmlWriter.Create(stream);
writer.WriteStartDocument();
writer.WriteStartElement("data");
char[] empty = { ' ' };
for(var i = 0; i < 10; ++i)
{
writer.WriteChars(empty, 0, 0);
writer.Flush();
Console.WriteLine("Flush offset: {0}, char: {1}",
stream.Position,
(char)stream.GetBuffer()[stream.Position - 1]);
writer.WriteStartElement("item");
writer.WriteValue("item " + i);
writer.WriteEndElement();
}
writer.WriteEndElement();
writer.WriteEndDocument();
And output is:
Flush offset: 47, char: >
Flush offset: 66, char: >
Flush offset: 85, char: >
Flush offset: 104, char: >
Flush offset: 123, char: >
Flush offset: 142, char: >
Flush offset: 161, char: >
Flush offset: 180, char: >
Flush offset: 199, char: >
Flush offset: 218, char: >
While this code works, we feel uneasy with it.
What's the better way to solve the task?
Update: further analysis shows that it's
only possible behaviour, as after the call to write srart element, you either
can write attributes, content or end of element, so writer may write either
space, '>' or '/>'. The only
question is why it takes WriteChars(empty, 0, 0) into account and WriteValue("")
it doesn't.
As you probably know we have implemented our custom Protocol Handler for the Windows
Search.
It's called .xml-gz, and has a goal to index compressed xml files and to have
search results with a subtree precision. So, for xml:
<data>
<item>...</item>
<item>...</item>
...
</data>
search finds results within item and returns xml's url and stream
offset of the item. Using ZLIB API we can compress data with stream bookmarks, so fast random
access to the data is possible.
The only problem we have is about notification of changes (create, delete, update)
of such files.
Spec describes several techniques (nothing has worked for us):
1. Call catalogManager.ReindexMatchingURLs()
- it just returns without any impact.
2.Call changeSink.OnItemsChanged()
- returns error.
3. Implement
.xml-gz IFilter and call IGatherNotifyInline (see "
have your .zip urls indexed when they are created or modified") -
that's a mistery, as:
4. Implement root url in form .xml-gz:/// and perform Windows Search:
SELECT
System.ItemUrl, System.DateModified
FROM
SystemIndex WHERE System.FileExtension='.xml-gz'
to find all .xml-gz sources. This is not reliable, as your protocol handler can
be (and is) called before file is indexed.
So, the only reliable way to index your data is to (re-)add indexing rule for
the protocol handler, which in most cases reindexes everything.
The only bearable solution we found is to define indexing rule in the form:
.xml-gz://file:d:/data/... and to use
IShellFolder(2)
interfaces to discover sub items and their modification times. This technique allows
minimal data scan when you're (re-)add indexing rule.
Being unexperienced with Windows Search we tried to build queries to find data in the huge storage. We needed to find a document that matches some name pattern and contains some text.
Our naive query was like this:
select top 1000 System.ItemUrl from SystemIndex where scope = '...' and System.ItemName like '...%' and contains('...')
In most cases this query returns nothing and runs very long. It's interesting to note that it may start returning data if "top" clause is missing or uses a bigger number, but in this cases query is slower even more.
Next try was like this:
select top 1000 System.ItemUrl from SystemIndex where scope = '...' and System.ItemName >= '...' and System.ItemName < '...' and contains('...')
This query is also slow, but at least it returns some results.
At some point we have started to question the utility of Windows Search if it's so slow, but then we have found that there is a property System.ItemNameDisplay, which in our case coincides with the value of property System.ItemName, so we have tried the query:
select top 1000 System.ItemUrl from SystemIndex where scope = '...' and System.ItemNameDisplay like '...%' and contains('...')
This query worked fast, and produced good results. This hints that search engine has index on System.ItemNameDisplay in contrast to System.ItemName property.
We've looked at property definitions:
System.ItemNameDisplay
The display name in "most complete" form. It is the unique representation of the item name most appropriate for end users.
propertyDescription name = System.ItemNameDisplay shellPKey = PKEY_ItemNameDisplay formatID = B725F130-47EF-101A-A5F1-02608C9EEBAC propID = 10 searchInfo inInvertedIndex = true isColumn = true isColumnSparse = false columnIndexType = OnDisk maxSize = 128
System.ItemName
The base name of the System.ItemNameDisplay property.
propertyDescription name = System.ItemName shellPKey = PKEY_ItemName formatID = 6B8DA074-3B5C-43BC-886F-0A2CDCE00B6F propID = 100 searchInfo inInvertedIndex = false isColumn = true isColumnSparse = false columnIndexType = OnDisk maxSize = 128
Indeed, one property is indexed, while the other is not.
As with other databases, query is powerful when engine uses indices rather than performs data scan. This is also correct for Windows Search.
The differences in results that variations of query produce also manifests that Windows Search nevertheless is very different from relational database.
We have developed our custom Windows Search Protocol Handler. The role of this component is to expose items of complex content (or unusual storage) to Windows Search.
You can think of some virtual folder, so a Protocol Handler allows to enumerate it's files, file properties, and contents.
The goal of our Protocol Handler is to represent some data structure as a set of xml files. We expected that if we found a data within a folder with these files, then a search within Protocol Handler's scope would bring the same (or almost the same) results.
Reality is different.
For some reason .xml IFilter (a component to extract text data to index) works differently with file system and with our storage. We cannot state that it does not work, but for some reason many words that Windows Search finds within a file are never found within Protocol Handler scope.
We have observed that if, for purpose of indexing, we represent content xml items as .txt files, then search works as expected. So, our workaround was to present only xml's text data for the indexing, and to use .txt IFilter (this in fact roughly what .xml IFilter does by itself).
Is there a conclusion?
Well, Windows Search is a black box probably containing bugs. Its behaviour is not always obvious.
Let's put it blatantly: Windows Search 4 has design and implementation problems.
You discover this immediatelly when you start implementing indexing of custom file format.
If you want to index simple file format then you need to imlement you IFilter interface. But if it has happened so that you want to index compound data then you should invent you own protocols.
If you will fugure out how to implement your protocol to index that compound data, then you will most probably stuck on the problem on how to notify indexer about the changes.
The problem is that Windows Search 4 has API to reindex urls, which simply does not work, or to notify indexer about changes, which throws an error (returns error HRESULT) for custom protocols. At least, we were not able to make it run.
Do you know that the best JSF/Facelets visual editor, in our opinion, is ... Microsoft Visual Studio 2008? Another rather good JSF editor is presented in IBM RAD 7.xx. The most popular, open source Java IDE Eclipse contains an ugly implementation of such useful thing.
Already for a couple of days we're trying to create a UserControl containing a
TabContainer. To achieve the goal we have
created a page with a ToolkitScriptManager and a user control itself.
Page:
<form runat="server">
<ajax:ToolkitScriptManager ID="ToolkitScriptManager1" runat="server"/>
<uc1:WebUserControl1 ID="WebUserControl11" runat="server" />
</form>
User control:
<%@ Control Language="C#" %>
<%@ Register
Assembly="AjaxControlToolkit"
Namespace="AjaxControlToolkit"
TagPrefix="ajax" %>
<ajax:TabContainer ID="Tab" runat="server" Width="100%">
<ajax:TabPanel runat="server" HeaderText="Tab1" ID="Tab1">
<ContentTemplate>Panel 1</ContentTemplate>
</ajax:TabPanel>
</ajax:TabContainer>
What could be more simple?
But no, there is a problem. At run-time
control works perfectly, but at the
designer it shows an error instead of a normal design view:
Error Rendering Control - TabContainer1
An unhandled exception has occurred.
Could not find any resources appropriate for the specified culture or the neutral culture. Make sure "AjaxControlToolkit.Properties.Resources.NET4.resources" was correctly embedded or linked into assembly "AjaxControlToolkit" at compile time, or that all the satellite assemblies required are loadable and fully signed.
That's a stupid error, which says nothing about the real problem reason. We had
to attach a debugger to a Visual Studio just to realize what the problem is.
So, the error occurs at the following code of AjaxControlToolkit.ScriptControlBase:
private void EnsureScriptManager()
{
if (this._scriptManager == null)
{
this._scriptManager = ScriptManager.GetCurrent(this.Page);
if (this._scriptManager == null)
{
throw new HttpException(Resources.E_NoScriptManager);
}
}
}
Originally, the problem is due to the fact that ScriptManager is not found, and code
wants to report an HttpException, but fun is that we recieve a different exception, which is releted to a missing resouce text for a message Resources.E_NoScriptManager.
It turns out that E_NoScriptManager text is found neither in
primary no in resource assemblies.
As for original problem, it's hard to say about reason of why ScriptManager is
not available at design time. We, however, observed that a ScriptManager
registers itself for a ScriptManager.GetCurrent() method at
run-time only:
protected internal override void OnInit(EventArgs e)
{
...
if (!base.DesignMode)
{
...
iPage.Items[typeof(ScriptManager)] = this;
...
}
}
So, it's not clear what they (toolkit's developers) expected to get at design
time.
These observations leave uneasiness regarding the quality of the library.
Earlier, we have described an approach to call Windows Search from SQL Server 2008. But it has turned out that our problem is more complicated...
All has started from the initial task:
- to allow free text search in a store of huge xml files;
- files should be compressed, so these are *.xml.gz;
- search results should be addressable to a fragment within xml.
Later we shall describe how we have solved this task, and now it's enough to say that we have implemented a Protocol Handler for Windows Search named '.xml-gz:'. This way original file stored say at 'file:///c:/store/data.xml-gz' is seen as a container by the Windows Search:
- .xml-gz:///file:c:/store/data.xml-gz/id1.xml
- .xml-gz:///file:c:/store/data.xml-gz/id2.xml
- ...
This way search xml should be like this:
select System.ItemUrl from SystemIndex where scope='.xml-gz:' and contains(...)
Everything has worked during test: we have succeeded to issue Windows Search selects from SQL Server and join results with other sql queries.
But later on when we considered a runtime environment we have seen that our design won't work. The reason is simple. Windows Search will work on a computer different from those where SQL Servers run. So, the search query should look like this:
select System.ItemUrl from Computer.SystemIndex where scope='.xml-gz:' and contains(...)
Here we have realized the limitation of current (Windows Search 4) implementation: remote search works for shared folders only, thus query may only look like:
select System.ItemUrl from Computer.SystemIndex where scope='file://Computer/share/' and contains(...)
Notice that search restricts the scope to a file protocol, this way remoter search will never return our results. The only way to search in our scope is to perform a local search.
We have considered following approaches to resolve the issue.
The simplest one would be to access Search protocol on remote computer using a connection string: "Provider=Search.CollatorDSO;Data Source=Computer" and use local queries. This does not work, as provider simply disregards Data Source parameter.
The other try was to use MS Remote OLEDB provider. We tried hard to configure it but it always returns obscure error, and more than that it's deprecated (Microsoft claims to remove it in future).
So, we decided to forward request manually:
- SQL Server calls a web service (through a CLR function);
- Web service queries Windows Search locally.
Here we considered WCF Data Services and a custom web service.
The advantage of WCF Data Services is that it's a technology that has ambitions of a standard but it's rather complex task to create implementation that will talk with Windows Search SQL dialect, so we have decided to build a primitive http handler to get query parameter. That's trivial and also has a virtue of simple implementation and high streamability.
So, that's our http handler (WindowsSearch.ashx):
<%@ WebHandler Language="C#" Class="WindowsSearch" %>
using System; using System.Web; using System.Xml; using System.Text; using System.Data.OleDb;
/// <summary> /// A Windows Search request handler. /// </summary> public class WindowsSearch: IHttpHandler { /// <summary> /// Handles the request. /// </summary> /// <param name="context">A request context.</param> public void ProcessRequest(HttpContext context) { var request = context.Request; var query = request.Params["query"]; var response = context.Response;
response.ContentType = "text/xml"; response.ContentEncoding = Encoding.UTF8;
var writer = XmlWriter.Create(response.Output);
writer.WriteStartDocument(); writer.WriteStartElement("resultset");
if (!string.IsNullOrEmpty(query)) { using(var connection = new OleDbConnection(provider)) using(var command = new OleDbCommand(query, connection)) { connection.Open();
using(var reader = command.ExecuteReader()) { string[] names = null;
while(reader.Read()) { if (names == null) { names = new string[reader.FieldCount];
for (int i = 0; i < names.Length; ++i) { names[i] = XmlConvert.EncodeLocalName(reader.GetName(i)); } }
writer.WriteStartElement("row");
for(int i = 0; i < names.Length; ++i) { writer.WriteElementString( names[i], Convert.ToString(reader[i])); }
writer.WriteEndElement(); } } } }
writer.WriteEndElement(); writer.WriteEndDocument();
writer.Flush(); }
/// <summary> /// Indicates that a handler is reusable. /// </summary> public bool IsReusable { get { return true; } }
/// <summary> /// A connection string. /// </summary> private const string provider = "Provider=Search.CollatorDSO;" + "Extended Properties='Application=Windows';" + "OLE DB Services=-4"; }
And a SQL CLR function looks like this:
using System; using System.Collections; using System.Collections.Generic; using System.Data; using System.Data.SqlClient; using System.Data.SqlTypes; using Microsoft.SqlServer.Server; using System.Net; using System.IO; using System.Xml;
/// <summary> /// A user defined function. /// </summary> public class UserDefinedFunctions { /// <summary> /// A Windows Search returning result as xml strings. /// </summary> /// <param name="url">A search url.</param> /// <param name="userName">A user name for a web request.</param> /// <param name="password">A password for a web request.</param> /// <param name="query">A Windows Search SQL.</param> /// <returns>A result rows.</returns> [SqlFunction( IsDeterministic = false, Name = "WindowsSearch", FillRowMethodName = "FillWindowsSearch", TableDefinition = "value nvarchar(max)")] public static IEnumerable Search( string url, string userName, string password, string query) { return SearchEnumerator(url, userName, password, query); }
/// <summary> /// A filler of WindowsSearch function. /// </summary> /// <param name="value">A value returned from the enumerator.</param> /// <param name="row">An output value.</param> public static void FillWindowsSearch(object value, out string row) { row = (string)value; }
/// <summary> /// Gets a search row enumerator. /// </summary> /// <param name="url">A search url.</param> /// <param name="userName">A user name for a web request.</param> /// <param name="password">A password for a web request.</param> /// <param name="query">A Windows Search SQL.</param> /// <returns>A result rows.</returns> private static IEnumerable<string> SearchEnumerator( string url, string userName, string password, string query) { if (string.IsNullOrEmpty(url)) { throw new ArgumentException("url"); }
if (string.IsNullOrEmpty(query)) { throw new ArgumentException("query"); }
var requestUrl = url + "?query=" + Uri.EscapeDataString(query);
var request = WebRequest.Create(requestUrl);
request.Credentials = string.IsNullOrEmpty(userName) ? CredentialCache.DefaultCredentials : new NetworkCredential(userName, password);
using(var response = request.GetResponse()) using(var stream = response.GetResponseStream()) using(var reader = XmlReader.Create(stream)) { bool read = true;
while(!read || reader.Read()) { if ((reader.Depth == 1) && reader.IsStartElement()) { // Note that ReadInnerXml() advances the reader similar to Read(). yield return reader.ReadInnerXml();
read = false; } else { read = true; } } } } }
And, finally, when you call this service from SQL Server you write query like this:
with search as ( select cast(value as xml) value from dbo.WindowsSearch ( N'http://machine/WindowsSearchService/WindowsSearch.ashx', null, null, N' select "System.ItemUrl" from SystemIndex where scope=''.xml-gz:'' and contains(''...'')' ) ) select value.value('/System.ItemUrl[1]', 'nvarchar(max)') from search
Design is not trivial but it works somehow.
After dealing with all these problems some questions remain unanswered:
- Why SQL Server does not allow to query Windows Search directly?
- Why Windows Search OLEDB provider does not support "Data Source" parameter?
- Why Windows Search does not support custom protocols during remote search?
- Why SQL Server does not support web request/web services natively?
We were trying to query Windows Search from an SQL Server 2008.
Documentation states that Windows Search is exposed as OLE DB datasource. This meant that we could just query result like this:
SELECT * FROM OPENROWSET( 'Search.CollatorDSO.1', 'Application=Windows', 'SELECT "System.ItemName", "System.FileName" FROM SystemIndex');
But no, such select never works. Instead it returns obscure error messages:
OLE DB provider "Search.CollatorDSO.1" for linked server "(null)" returned message "Command was not prepared.". Msg 7399, Level 16, State 1, Line 1 The OLE DB provider "Search.CollatorDSO.1" for linked server "(null)" reported an error. Command was not prepared. Msg 7350, Level 16, State 2, Line 1 Cannot get the column information from OLE DB provider "Search.CollatorDSO.1" for linked server "(null)".
Microsoft is silent about reasons of such behaviour. People came to a conclusion that the problem is in the SQL Server, as one can query search results through OleDbConnection without problems.
This is very unfortunate, as it bans many use cases.
As a workaround we have defined a CLR function wrapping Windows Search call and returning rows as xml fragments. So now the query looks like this:
select value.value('System.ItemName[1]', 'nvarchar(max)') ItemName, value.value('System.FileName[1]', 'nvarchar(max)') FileName from dbo.WindowsSearch('SELECT "System.ItemName", "System.FileName" FROM SystemIndex')
Notice how we decompose xml fragment back to fields with the value() function.
The C# function looks like this:
using System; using System.Collections; using System.IO; using System.Xml; using System.Data; using System.Data.SqlClient; using System.Data.SqlTypes; using System.Data.OleDb;
using Microsoft.SqlServer.Server;
public class UserDefinedFunctions { [SqlFunction( FillRowMethodName = "FillSearch", TableDefinition="value xml")] public static IEnumerator WindowsSearch(SqlString query) { const string provider = "Provider=Search.CollatorDSO;" + "Extended Properties='Application=Windows';" + "OLE DB Services=-4";
var settings = new XmlWriterSettings { Indent = false, CloseOutput = false, ConformanceLevel = ConformanceLevel.Fragment, OmitXmlDeclaration = true };
string[] names = null;
using(var connection = new OleDbConnection(provider)) using(var command = new OleDbCommand(query.Value, connection)) { connection.Open();
using(var reader = command.ExecuteReader()) { while(reader.Read()) { if (names == null) { names = new string[reader.FieldCount];
for (int i = 0; i < names.Length; ++i) { names[i] = XmlConvert.EncodeLocalName(reader.GetName(i)); } }
var stream = new MemoryStream(); var writer = XmlWriter.Create(stream, settings);
for(int i = 0; i < names.Length; ++i) { writer.WriteElementString(names[i], Convert.ToString(reader[i])); }
writer.Close();
yield return new SqlXml(stream); } } } }
public static void FillSearch(object value, out SqlXml row) { row = (SqlXml)value; } }
Notes:
- Notice the use of "
OLE DB Services=-4" in provider string to avoid transaction enlistment (required in SQL Server 2008).
- Permission level of the project that defines this extension function should be set to unsafe (see Project Properties/Database in Visual Studio) otherwise it does not allow the use OLE DB.
- SQL Server should be configured to allow CLR functions, see Server/Facets/Surface Area Configuration/ClrIntegrationEnabled in Microsoft SQL Server Management Studio
- Assembly should either be signed or a database should be marked as trustworthy, see Database/Facets/Trustworthy in Microsoft SQL Server Management Studio.
Last few days we were testing Java web-applications that expose web-services. During these tests we've found few interesting features.
The first feature allows to retrieve info about all endpoints supported by the web-application on GET request. The feature works at least for Metro that implements JAX-WS API v2.x. In order to get such info, a client sends any endpoint's URL to the server. The result is an HTML page with a table. Each row of such table contains an endpoint's data for each supported web-service method. This feature may be used as a web-services discovery mechanism.
The second feature is bad rather than good. JAX-WS API supposes that a developer annotates classes and methods that he/she wants to expose as web-services. Then, an implementation generates additional layer-bridge between developer's code and API that does all routine work behind the scene. May be that was a good idea, but Metro's implementation is imperfect. Metro dynamically generates such classes at run-time when a web-application starts. Moreover, Metro does such generation for all classes at once. So, in our case, when the generated web-based application contains dozens or even hundreds of web-services, the application's startup takes a lot of time.
Probably, Metro developers didn't want to deal with implementation of lazy algorithms, when a web-service is generated and cached on demand. We hope this issue will be solved in next releases.
A while ago we have created
a simple cache for Java application. It was modelled like a Map<K, V>: it cached
values for keys.
Use cases were:
Cache<String, Object> cache = new Cache<String, Object>();
...
instance = cache.get("key");
cache.put("key", instance);
But now we thought of different implementation like a WeakReference<V>
and with map access as additional utility methods.
Consider an examples:
1. Free standing CachedReference<V> instance.
CachedReference<Data> ref = new CachedReference<Data>(1000, true);
...
ref.set(data);
...
data = ref.get();
2. Map of CachedReference<V> instances.
ConcurrentHashMap<String, CachedReference<Data>> cache =
new
ConcurrentHashMap<String, CachedReference<Data>>();
CachedReference.put(cache, "key", data, 1000, true);
... data = CachedReference.get(cache, "key");
The first case is faster than original Cache<K, V> as it does not use any hash
map at all. The later case provides the same performance as Cache<K, V> but
gives a better control over the storage. Incidentally, CachedReference<V> is more compact than Cache<K, V>.
The new implementation is
CachedReference.java, the old one
Cache.java.
A method pattern we have suggested to use along with @Yield annotation brought
funny questions like: "why should I mark my method with @Yield annotation at
all?"
Well, in many cases you may live with ArrayList populated with data, and then to
perform iteration. But in some cases this approach is not practical either due
to amount of data or due to the time required to get first item.
In later
case you usually want to build an iterator that calculates items on demand. The @Yield annotation is designed as a marker of such methods. They are refactored
into state machines at compilation time, where each addition to a result list is
transformed into a new item yielded by the iterator.
So, if you have decided to use @Yield annotation then at some point you will ask yourself what
happens with resources acquired during iteration. Will resources be released if
iteration is interrupted in the middle due to exception or a break statement?
To address the problem yield iterator implements Closeable interface.
This way when you call close() before iteration reached the end, the state machine
works as if break statement of the method body is injected after the yield
point. Thus all finally blocks of the original method are executed and resources
are released.
Consider an example of data iterator:
@Yield
public Iterable<Data> getData(final Connection connection)
throws Exception
{
ArrayList<Data> result = new ArrayList<Data>();
PreparedStatement statement =
connection.prepareStatement("select key, value from table");
try
{
ResultSet resultSet = statement.executeQuery();
try
{
while(resultSet.next())
{
Data data = new Data();
data.key = resultSet.getInt(1);
data.value = resultSet.getString(2);
result.add(data); // yield point
}
}
finally
{
resultSet.close();
}
}
finally
{
statement.close();
}
return result;
}
private static void close(Object value)
throws IOException
{
if (value instanceof Closeable)
{
Closeable closeable = (Closeable)value;
closeable.close();
}
}
public void daoAction(Connection connection)
throws Exception
{
Iterable<Data> items = getData(connection);
try
{
for(Data data: items)
{
// do something that potentially throws exception.
}
}
finally
{
close(items);
}
}
getData() iterates over sql data. During the lifecycle it creates and releases
PreparedStatement and ResultSet.
daoAction() iterates over results provided by getData() and performs some
actions that potentially throw an exception. The goal of close() is to release
opened sql resources in case of such an exception.
Here you can inspect how state machine is implemented for such a method:
@Yield()
public static Iterable<Data> getData(final Connection connection)
throws Exception
{
assert (java.util.ArrayList<Data>)(ArrayList<Data>)null == null;
class $state implements java.lang.Iterable<Data>, java.util.Iterator<Data>, java.io.Closeable
{
public java.util.Iterator<Data> iterator() {
if ($state$id == 0) {
$state$id = 1;
return this;
} else return new $state();
}
public boolean hasNext() {
if (!$state$nextDefined) {
$state$hasNext = $state$next();
$state$nextDefined = true;
}
return $state$hasNext;
}
public Data next() {
if (!hasNext()) throw new java.util.NoSuchElementException();
$state$nextDefined = false;
return $state$next;
}
public void remove() {
throw new java.lang.UnsupportedOperationException();
}
public void close() {
do switch ($state$id) {
case 3:
$state$id2 = 8;
$state$id = 5;
continue;
default:
$state$id = 8;
continue;
} while ($state$next());
}
private boolean $state$next() {
java.lang.Throwable $state$exception;
while (true) {
try {
switch ($state$id) {
case 0:
$state$id = 1;
case 1:
statement = connection.prepareStatement("select key, value from table");
$state$exception1 = null;
$state$id1 = 8;
$state$id = 2;
case 2:
resultSet = statement.executeQuery();
$state$exception2 = null;
$state$id2 = 6;
$state$id = 3;
case 3:
if (!resultSet.next()) {
$state$id = 4;
continue;
}
data = new Data();
data.key = resultSet.getInt(1);
data.value = resultSet.getString(2);
$state$next = data;
$state$id = 3;
return true;
case 4:
$state$id = 5;
case 5:
{
resultSet.close();
}
if ($state$exception2 != null) {
$state$exception = $state$exception2;
break;
}
if ($state$id2 > 7) {
$state$id1 = $state$id2;
$state$id = 7;
} else $state$id = $state$id2;
continue;
case 6:
$state$id = 7;
case 7:
{
statement.close();
}
if ($state$exception1 != null) {
$state$exception = $state$exception1;
break;
}
$state$id = $state$id1;
continue;
case 8:
default:
return false;
}
} catch (java.lang.Throwable e) {
$state$exception = e;
}
switch ($state$id) {
case 3:
case 4:
$state$exception2 = $state$exception;
$state$id = 5;
continue;
case 2:
case 5:
case 6:
$state$exception1 = $state$exception;
$state$id = 7;
continue;
default:
$state$id = 8;
java.util.ConcurrentModificationException ce = new java.util.ConcurrentModificationException();
ce.initCause($state$exception);
throw ce;
}
}
}
private PreparedStatement statement;
private ResultSet resultSet;
private Data data;
private int $state$id;
private boolean $state$hasNext;
private boolean $state$nextDefined;
private Data $state$next;
private java.lang.Throwable $state$exception1;
private int $state$id1;
private java.lang.Throwable $state$exception2;
private int $state$id2;
}
return new $state();
}
Now, you can estimate for what it worth to write an algorithm as a sound state machine
comparing to the conventional implementation.
Yield annotation processor can be downloaded from
Yield.zip
or Yield.jar
See also
Yield return feature in java.
We're happy to announce that we have implemented @Yield annotation
both in javac and in eclipse compilers.
This way you get built-in IDE support for the feature!
To download yield annotation processor please use the following link:
Yield.zip
It contains both yield annotation processor, and a test project.
If you do not want to compile the sources, you can download
Yield.jar
We would like to reiterate on how @Yield annotation works:
- A developer defines a method that returns either
Iterator<T> or
Iterable<T> instance and marks it with @Yield
annotation.
- A developer implements iteration logic following the pattern:
- declare a variable to accumulate results:
ArrayList<T> items = new ArrayList<T>();
- use the following statement to add item to result:
items.add(...);
- use
return items;
or
return items.iterator();
to return result;
- mark method's params, if any, as final.
- A devoloper ensures that yield annotation processor is available during
compilation (see details below).
YieldProcessor rewrites method into a state machine at
compilation time.
The following is an example of such a method:
@Yield
public static Iterable<Integer> generate(final int from, final int to)
{
ArrayList<Integer> items = new ArrayList<Integer>();
for(int i = from; i < to; ++i)
{
items.add(i);
}
return items;
}
The use is like this:
for(int value: generate(7, 20))
{
System.out.println("generator: " + value);
}
Notice that method's implementation still will be correct in absence of
YieldProcessor.
Other important feature is that the state machine returned after the yield
processor is closeable.
This means that if you're breaking the iteration before the end is reached you
can release resources acquired during the iteration.
Consider the example where break exits iteration:
@Yield
public static Iterable<String> resourceIteration()
{
ArrayList<String> items = new ArrayList<String>();
acquire();
try
{
for(int i = 0; i < 100; ++i)
{
items.add(String.valueOf(i));
}
}
finally
{
release();
}
return items;
}
and the use
int i = 0;
Iterable<String> iterator = resourceIteration();
try
{
for(String item: iterator)
{
System.out.println("item " + i + ":" + item);
if (i++ > 30)
{
break;
}
}
}
finally
{
close(iterator);
}
...
private static <T> void close(T value)
throws IOException
{
if (value instanceof Closeable)
{
Closeable closeable = (Closeable)value;
closeable.close();
}
}
Close will execute all required finally blocks. This way resources will be
released.
To configure yield processor a developer needs to refer Yield.jar in build path,
as it contains @Yield annotation. For javac it's enough, as
compiler will find annotation processor automatically.
Eclipse users need to open project properties and:
- go to the "Java Compiler"/"Annotation Processing"
- mark "Enable project specific settings"
- select "Java Compiler"/"Annotation Processing"/"Factory Path"
- mark "Enable project specific settings"
- add Yield.jar to the list of "plug-ins and JARs that contain annotation
processors".
At the end we want to point that @Yield annotation is a syntactic
suggar, but it's important the way the foreach statement is important, as it
helps to write concise and an error free code.
See also
Yield feature in java implemented!
Yield feature in java
We could not stand the temptation to implement the @Yield annotation that
we described
earlier.
Idea is rather clear but people are saying that it's not an easy task to update
the sources.
They were right!
Implementation has its price, as we were forced to access JDK's classes of javac
compiler. As result, at present, we don't support other compilers such as
EclipseCompiler.
We shall look later what can be done in this area.
At present, annotation processor works perfectly when you run javac either from
the command line, from ant, or from other build tool.
Here is an example of how method is refactored:
@Yield
public static Iterable<Long> fibonachi()
{
ArrayList<Long> items = new ArrayList<Long>();
long Ti = 0;
long Ti1 = 1;
while(true)
{
items.add(Ti);
long value = Ti + Ti1;
Ti = Ti1;
Ti1 = value;
}
}
And that's how we transform it:
@Yield()
public static Iterable<Long> fibonachi() {
assert (java.util.ArrayList<Long>)(ArrayList<Long>)null == null : null;
class $state$ implements java.lang.Iterable<Long>, java.util.Iterator<Long>, java.io.Closeable {
public java.util.Iterator<Long> iterator() {
if ($state$id == 0) {
$state$id = 1;
return this;
} else return new $state$();
}
public boolean hasNext() {
if (!$state$nextDefined) {
$state$hasNext = $state$next();
$state$nextDefined = true;
}
return $state$hasNext;
}
public Long next() {
if (!hasNext()) throw new java.util.NoSuchElementException();
$state$nextDefined = false;
return $state$next;
}
public void remove() {
throw new java.lang.UnsupportedOperationException();
}
public void close() {
$state$id = 5;
}
private boolean $state$next() {
while (true) switch ($state$id) {
case 0:
$state$id = 1;
case 1:
Ti = 0;
Ti1 = 1;
case 2:
if (!true) {
$state$id = 4;
break;
}
$state$next = Ti;
$state$id = 3;
return true;
case 3:
value = Ti + Ti1;
Ti = Ti1;
Ti1 = value;
$state$id = 2;
break;
case 4:
case 5:
default:
$state$id = 5;
return false;
}
}
private long Ti;
private long Ti1;
private long value;
private int $state$id;
private boolean $state$hasNext;
private boolean $state$nextDefined;
private Long $state$next;
}
return new $state$();
}
Formatting is automatic, sorry, but anyway it's for diagnostics only. You
will never see this code.
It's iteresting to say that this implementation is very precisely mimics
xslt state machine implementation we have done back in 2008.
You can
download YieldProcessor here. We hope that someone will find our solution
very interesting.
Several times we have already wished to see
yield feature in java and all the time came to the same implementation:
infomancers-collections.
And every time with dissatisfaction turned away, and continued with regular
iterators.
Why? Well, in spite of the fact it's the best implementation of the feature we have
seen, it's still too heavy, as it's playing with java byte code at run-time.
We never grasped the idea why it's done this way, while there is
post-compile
time annotation processing in java.
If we would implemented the yeild feature in java we would created a @Yield
annotation and would demanded to implement some well defined code pattern like
this:
@Yield
Iteratable<String> iterator()
{
// This is part of pattern.
ArrayList<String> list = new ArrayList<String>();
for(int i = 0; i < 10; ++i)
{
// list.add() plays the role of yield return.
list.add(String.valueOf(i));
}
// This is part of pattern.
return list;
}
or
@Yield
Iterator<String> iterator()
{
// This is part of pattern.
ArrayList<String> list = new ArrayList<String>();
for(int i = 0; i < 10; ++i)
{
// list.add() plays the role of yield return.
list.add(String.valueOf(i));
}
// This is part of pattern.
return list.iterator();
}
Note that the code will work correctly even, if by mischance, post-compile-time
processing will not take place.
At post comile time we would do all required refactoring to turn these
implementations into a state machines thus runtime would not contain any third
party components.
It's iteresting to recall that we have also implemented similar refactoring in
pure xslt.
See What you can do with jxom.
Update: implementation can be found at Yield.zip
We have a class Beans used to serialize a list of generic objects into an xml.
This is done like this:
public class Call
{
public Beans input;
public Beans output;
...
}
@XmlJavaTypeAdapter(value = BeanAdapter.class)
public class Beans
{
public List<Object> bean;
}
Thanks to @XmlJavaTypeAdapter, we're able to write xml in
whatever form we want.
When we're serializing a Call instance:
Call call = ...
Beans beans = ...;
call.setInput(beans);
JAXBContext context = ...;
Marshaller marshaler = context.createMarshaller();
ObjectFactory factory = ...;
marshaler.marshal(factory.createCall(call),
result);
things work as expected, meaning that BeanAdapter is used during xml
serialization. But if it's happened that you want to serialize a Beans instance
itself, you start getting problems with the serialization of unknown objects. That's because JAXB does not use BeanAdapter.
We have found a similar case "How to assign an adapter
to the root element?", unfortunately with no satisfactory explanation.
That is strange.
michaelhkay: Saxon 9.3 has been out for 8 days: only two bugs so far, one found by me. I think that's a record.
Not necessary. We, for example, who use Saxon HE, have found nothing new in Saxon 9.3, while expected to see xslt 3.0. Disappointed. No actual reason to migrate.
P.S. We were among the first who were finding early bugs in previous releases.
Reading individual papers of C++ WG, you can find the following one:
|
N3174
|
10-0164
|
To move or not to move
|
Bjarne Stroustrup
|
2010-10-17
|
2010-10
|
|
Core
|
There, Bjarne Stroustrup thinks about issues with implicitly generated copy and
move operations in C++.
It's always a pleasure to see how one can deal with a problem burdened with
antagonisms. To conduct his position Bjarne skilfully uses not only rational but
also emotional argumentation:
...We may deem this “bad code that deserves to be broken” or “unrealistic”, but
this example demonstrates that the problem with a generated move has an exact counterpart
for copy (which we have lived with for 27 years)...
...In 1984, I missed the chance to protect us against copy and we have lived with
the problems ever since. I should have instituted some rule along the lines “if
a class has a destructor, no copy operations are generated” or “if a class has a
pointer member, no copy operations are generated.”...
It's impossible to recall this numbers without shivering. :-)
We're following w3's "Bug
9069 - Function to invoke an XSLT transformation".
There, people argue about xpath API to invoke xslt transformations. Function should
look roughly like this:
transform
(
$node-tree as node()?,
$stylesheet as item(),
$parameters as XXX
) as node()
The discussion is spinning around the last argument: $parameters as
XXX. Should it be an xml element describing parameters, a function returning values for parameter names, or some new type modelling immutable
map?
What is most interesting in this discussion is the leak about plans to introduce
a map type:
Comment 7 Michael Kay, 2010-09-14 22:46:58 UTC
We're currently talking about adding an immutable map to XSLT as a new data
type (the put operation would return a new map). There appear to be a number of
possible efficient implementations. It would be ideally suited for this purpose,
because unlike the mechanism used for serialization parameters, the values can be
any data type (including nodes), not only strings.
There is a hope that map will finally appear in xslt!
See also:
Bug 5630
- [DM] Tuples and maps,
Tuples and maps - Status: CLOSED, WONTFIX,
Map, based on immutable trees,
Maps in exslt2?
In the
previous post we have announced an
API to parse a COBOL source into the cobolxom.
We exploited the
incremental parser to build a grammar xml tree and then were planning to
create an xslt transformation to generate
cobolxom.
Now, we would like to declare that such xslt is ready.
At present all standard COBOL constructs are supported, but more tests
are required. Preprocessor support is still in the todo list.
You may peek into an examples of
COBOL:
Cobol grammar:
And
cobolxom:
While we were building a grammar to cobolxom stylesheet we asked ourselves
whether the COBOL parsing could be done entirely in xslt. The answer is yes, so
who knows it might be that we shall turn this task into pure xslt one. :-)
Accidentally we have found that implementation of String and StringBuilder
have been considerably revised, while public interface has remained the
same.
public sealed class String
{
private int m_arrayLength;
private int m_stringLength;
private char
m_firstChar;
}
This layout is dated to .NET 1.0.
VM, in fact, allocates more memory than that defined in C# class, as
&m_firstChar refers to an inline char buffer.
This way string's buffer length and string's length were two different
values, thus StringBuilder used this fact and stored its content in a private string
which it modified in place.
In .NET 4, string is different:
public sealed class String
{
private int m_stringLength;
private char
m_firstChar;
}
Memory footprint of such structure is smaller, but string's length should
always be the same as its buffer. In fact layout of string is now the same as
layout of char[].
This modification leads to implementation redesign of the StringBuilder.
Earlier, StringBuilder looked like the following:
public sealed class StringBuilder
{
internal IntPtr m_currentThread;
internal int m_MaxCapacity;
internal volatile
string m_StringValue;
}
Notice that m_StringValue is used as a storage, and
m_currentThread is used to preserve thread affinity of the internal
string value.
Now, guys at Microsoft have decided to implement StringBuilder very differently:
public sealed class StringBuilder
{
internal int m_MaxCapacity;
internal int m_ChunkLength;
internal int m_ChunkOffset;
internal char[] m_ChunkChars;
internal StringBuilder m_ChunkPrevious;
}
Inspection of this layout immediately reveals implementation technique. It's a
list of chunks. Instance itself references the last chunk (most recently
appended), and the previous chunks.
Characteristics of this design are:
- while
Length is small, performance almost the same as it was earlier;
- there are no more thread affinity checks;
Append(), and ToString() works as fast a in the old version.
Insert() in the middle works faster, as only a chuck should be splitted and
probably reallocated (copied), instead of the whole string;
- Random access is fast at the end O(1) and slows when you approaching the start
O(chunk-count).
Personally, we would select a slightly different design:
public sealed class StringBuilder
{
private struct Chunk
{
public int length; // Chunk length.
public int offset; // Chunk offset.
public char[] buffer;
}
private int m_MaxCapacity;
// Alternatively, one can use
// private List<Chunk> chunks;
private int chunkCount; // Number of used chunks.
private Chunk[] chunks; // Array of chunks except last.
private Chunk last; // Last chunk.
private bool nonHomogenous; // false if all chunks are of the same size.
}
This design has better memory footprint, and random access time is O(1) when there were no
inserts in the middle (nonHomogenous=false), and
O(log(chunkCount)) after such inserts. All other characteristics are the
same.
We have run into another xslt bug, which depends on several independent
circumstances and often behaves differently being observed. That's clearly a
Heisenbug.
Xslt designers failed to realize that a syntactic suggar they introduce into
xpath can turn into obscure bugs. Well, it's easy to be wise afterwards...
To the point.
Consider you have a sequence consisting of text nodes and
elements, and now you want to "normalize" this sequence wrapping
adjacent text nodes into
separate elements. The following stylesheet is supposed to do the work:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="http://www.nesterovsky-bros.com/xslt/this"
exclude-result-prefixes="xs t">
<xsl:template match="/">
<xsl:variable
name="nodes" as="node()*">
<xsl:text>Hello, </xsl:text>
<string value="World"/>
<xsl:text>! </xsl:text>
<xsl:text>Well, </xsl:text>
<string value="hello"/>
<xsl:text>, if not joking!</xsl:text>
</xsl:variable>
<result>
<xsl:sequence
select="t:normalize($nodes)"/>
</result>
</xsl:template>
<xsl:function
name="t:normalize" as="node()*">
<xsl:param name="nodes" as="node()*"/>
<xsl:for-each-group select="$nodes" group-starting-with="*">
<xsl:variable
name="string" as="element()?" select="self::string"/>
<xsl:variable name="texts"
as="node()*"
select="current-group() except $string"/>
<xsl:sequence
select="$string"/>
<xsl:if test="exists($texts)">
<string
value="{string-join($texts, '')}"/>
</xsl:if>
</xsl:for-each-group>
</xsl:function>
</xsl:stylesheet>
We're expecting the following output:
<result>
<string value="Hello, "/>
<string value="World"/>
<string value="! Well, "/>
<string value="hello"/>
<string value=", if not joking!"/>
</result>
But often we're getting other results, like:
<result>
<string value="Hello, "/>
<string value="World"/>
<string value="Well, ! "/>
<string value="hello"/>
<string value=", if not joking!"/>
</result>
Such output may seriously confuse, unless you will recall the rule for the
xpath except operator:
The except operator takes two node sequences as operands and returns a sequence containing all the nodes that occur in the first operand but not in the second operand.
... these operators eliminate duplicate nodes from their result sequences based
on node identity. The resulting sequence is returned in document order..
...
The relative order of nodes in distinct trees is stable but implementation-dependent
These words mean that result sequence may be very different from original
sequence.
In contrast, if we change $text definition to:
<xsl:variable name="texts"
as="node()*"
select="current-group()[not(. is $string)]"/>
then the result becomes stable, but less clear.
See also
Xslt Heisenbug
It does not matter that DataBindExtender looks not usual in the ASP.NET. It turns to be so handy that built-in data binding is not considered to be an option.
After a short try, you uderstand that people tried very hard and have invented many controls and methods like ObjectDataSource, FormView, Eval(), and Bind() with outcome, which is very specific and limited.
In contrast DataBindExtender performs:
- Two or one way data binding of any business data property to any control property;
- Converts value before it's passed to the control, or into the business data;
- Validates the value.
See an example:
<asp:TextBox id=Field8 EnableViewState="false" runat="server"></asp:TextBox> <bphx:DataBindExtender runat='server' EnableViewState='false' TargetControlID='Field8' ControlProperty='Text' DataSource='<%# Import.ClearingMemberFirm %>' DataMember='Id' Converter='<%# Converters.AsString("XXXXX", false) %>' Validator='<%# (extender, value) => Functions.CheckID(value as string) %>'/>
Here, we beside a regualar two way data binding of a property Import.ClearingMemberFirm.Id to a property Field8.Text, format (parse) Converters.AsString("XXXXX", false), and finally validate an input value with a lambda function (extender, value) => Functions.CheckID(value as string).
DataBindExtender works also well in template controls like asp:Repeater, asp:GridView, and so on. Having your business data available, you may reduce a size of the ViewState with EnableViewState='false'. This way DataBindExtender approaches page development to a pattern called MVC.
Recently, we have found that it's also useful to have a way to run a javascript during the page load (e.g. you want to attach some client side event, or register a component). DataBindExtender provides this with OnClientInit property, which is a javascript to run on a client, where this refers to a DOM element:
... OnClientInit='$addHandler(this, "change", function() { handleEvent(event, "Field8"); } );'/>
allows us to attach onchange javascript event to the asp:TextBox.
So, meantime we're very satisfied with what we can achieve with DataBindExtender. It's more than JSF allows, and much more stronger and neater to what ASP.NET has provided.
The sources can be found at DataBindExtender.cs
Lately, we have found that we've accustomed to declare C#'s local variables using var:
var exitStateName = exitState == null ? "" : exitState.Name;
var rules = Environment.NavigationRules;
var rule = rules[caller.Name];
var flow = rule.NavigationCases[procedure.OriginExitState];
This makes code cleaner, and in presense of good IDE still allows to figure out
types very easely.
We, howerer, found that var tends to have exceptions in its
uses. E.g. for some reason most of boolean locals in our code tend to remain explicit
(matter of taste?):
bool succeed = false;
try
{
...
succeed = true;
}
finally
{
if (!succeed)
{
...
}
}
Also, type often survives in for, but not in foreach:
for(int i = 0; i < sourceDataMapping.Length;
++i)
{
...
}
foreach(var property in properties)
{
...
}
In addition var has some limitations, as one cannot easily
initialize such local with null. From the following we prefer the first approach:
IWindowContext context = null;
var context = (IWindowContext)null;
var context = null as IWindowContext;
var context = default(IWindowContext);
We might need to figure out a consistent code style as for var. It
might be like that:
- Numeric, booleans and string locals should use explicit type;
- Try to avoid locals initialized with null, or without initializer, or use type
if such variable cannot be avoided;
- Use var in all other cases.
Another code style could be like that:
- For the consistency, completely avoid the use of keyword
var.
Recently we were raising a question about serialization of ASPX output in xslt.
The question went like this:
What's the recommended way of ASPX page generation? E.g.:
------------------------ <%@ Page AutoEventWireup="true" CodeBehind="CurMainMenuP.aspx.cs" EnableSessionState="True" Inherits="Currency.CurMainMenuP" Language="C#" MaintainScrollPositionOnPostback="True" MasterPageFile="Screen.Master" %>
<asp:Content ID="Content1" runat="server" ContentPlaceHolderID="Title">CUR_MAIN_MENU_P</asp:Content>
<asp:Content ID="Content2" runat="server" ContentPlaceHolderID="Content"> <span id="id1222146581" runat="server" class="inputField system UpperCase" enableviewstate="false"> <%# Dialog.Global.TranCode %> </span> ... ------------------------
Notice aspx page directives, data binding expessions, and prefixed tag names without namespace declarations.
There was a whole range of expected answers. We, however, looked whether somebody have already dealed with the task and has a ready solution at hands.
In general it seems that xslt community is very angry about ASPX: both format and technology. Well, put this aside.
The task of producing ASPX, which is almost xml, is not solvable when you're staying with pure xml serializer. Xslt's xsl:character-map does not work at all. In fact it looks as a childish attempt to address the problem, as it does not support character escapes but only grabs characters and substitutes them with strings.
We have decided to create ASPX serializer API producing required output text. This way you use <xsl:output method="text"/> to generate ASPX pages.
With this goal in mind we have defined a little xml schema to describe ASPX irregularities in xml form. These are:
<xs:element name="declared-prefix"> - to describe known prefixes, which should not be declared;
<xs:element name="directive"> - to describe directives like <%@ Page %>;
<xs:element name="content"> - a transparent content wrapper;
<xs:element name="entity"> - to issue xml entity;
<xs:element name="expression"> - to describe aspx expression like <%# Eval("A") %>;
<xs:element name="attribute"> - to describe an attribute of the parent element.
This approach greately simplified for us an ASPX generation process.
The API includes:
In previous posts we were crying about problems with JSF to ASP.NET migration. Let's point to another one.
Consider that you have an input field, whose value should be validated:
<input type="text" runat="server" ID="id1222146409" maxlength="4"/> <bphx:DataBindExtender runat="server" TargetControlID="id1222146409" ControlProperty="Value" DataSource="<%# Import.AaControlAttributes %>" DataMember="UserEnteredTrancode"/>
Here we have an input control, whose value is bound to Import.AaControlAttributes.UserEnteredTrancode property. But what is missed is a value validation. Somewhere we have a function that could answer the question whether the value is valid. It should be called like this: Functions.IsTransactionCodeValid(value).
Staying within standard components we can use a custom validator on the page:
<asp:CustomValidator runat="server" ControlToValidate="id1222146409" OnServerValidate="ValidateTransaction" ErrorMessage="Invalid transaction code."/>
and add the following code-behind:
protected void ValidateTransaction(object source, ServerValidateEventArgs args) { args.IsValid = Functions.IsTransactionCodeValid(args.Value); }
This approach works, however it pollutes the code-behind with many very similar methods. The problem is that the validation rules in most cases are not property of page but one of data model. That's why page validation methods just forward check to somewhere.
While thinking on how to simplify the code we have came up with more conscious and short way to express validators, namely using lambda functions. To that end we have introduced a Validator property of type ValueValidator over DataBindExtender. Where
/// <summary>A delegate to validate values.</summary> /// <param name="extender">An extender instance.</param> /// <param name="value">A value to validate.</param> /// <returns>true for valid value, and false otherwise.</returns> public delegate bool ValueValidator(DataBindExtender extender, object value);
/// <summary>An optional data member validator.</summary> public virtual ValueValidator Validator { get; set; }
With this new property the page markup looks like this:
<input type="text" runat="server" ID="id1222146409" maxlength="4"/> <bphx:DataBindExtender runat="server" TargetControlID="id1222146409" ControlProperty="Value" DataSource="<%# Import.AaControlAttributes %>" DataMember="UserEnteredTrancode" Validator='<%# (extender, value) => Functions.IsTransactionCodeValid(value as string) %>' ErrorMessage="Invalid transaction code."/>
This is almost like an event handler, however it allowed us to call data model validation logic without unnecessary code-behind.
The updated DataBindExtender can be found at DataBindExtender.cs.
Being well behind of the latest news and traps of the ASP.NET, we're readily falling on each problem. :-)
This time it's a script injection during data binding.
In JSF there is a component to output data called h:outputText. Its use is like this:
<span jsfc="h:outputText" value="#{myBean.myProperty}"/>
The output is a span element with data bound value embeded into content. The natural alternative in ASP.NET seems to be an asp:Label control:
<asp:Label runat="server" Text="<%# Eval("MyProperty") %>"/>
This almost works except that the h:outputText escapes data (you may override this and specify attribute escape="false"), and asp:Label never escapes the data.
This looks as a very serious omission in ASP.NET (in fact very close to a security hole). What are chances that when you're creating a new page, which uses data binding, you will not forget to fix code that wizard created for you and to change it to:
<asp:Label runat="server" Text="<%# Server.HtmlEncode(Eval("MyProperty")) %>"/>
Eh? Think what will happen if MyProperty will return a text that looks like a script (e.g.: <script>alert(1)</script>), while you just wanted to output a label?
To address the issue we've also introduced a property Escape into DataBindExtender. So at present we have a code like this:
<asp:Label runat="server" ID="MyLabel"/> <bphx:DataBindExtender runat="server" TargetControlID="MyLabel" ControlProperty="Text" ReadOnly="true" Escape="true" DataSource="<%# MyBean %>" DataMember="MyProperty"/>
See also: A DataBindExtender, Experience of JSF to ASP.NET migration
After struggling with ASP.NET data binding we found no other way but to introduce our little extender control to address the issue.
We were trying to be minimalistic and to introduce two way data binding and to support data conversion. This way extender control (called DataBindExtender) have following page syntax:
<asp:TextBox id=TextBox1 runat="server"></asp:TextBox> <cc1:DataBindExtender runat="server" DataSource="<%# Data %>" DataMember="ID" TargetControlID="TextBox1" ControlProperty="Text" />
Two way data binding is provided with DataSource object (notice data binding over this property) and a DataMember property from the one side, and TargetControlID and ControlProperty from the other side. DataBindExtender supports Converter property of type TypeConverter to support custom converters.
DataBindExtender is based on AjaxControlToolkit.ExtenderControlBase class and implements System.Web.UI.IValidator. ExtenderControlBase makes implementation of extenders extremely easy, while IValidator plugs natuarally into page validation (Validate method, Validators collections, ValidationSummary control).
The good point about extenders is that they are not visible in designer, while it exposes properties in extended control itself. The disadvantage is that it requires Ajax Control Toolkit, and also ScriptManager component of the page.
To simplify the use DataBindExtender gets data from control and puts the value into data source in Validate method, and puts data into control in OnPreRender method; thus no specific action is required to perform data binding.
Source for the DataBindExtender is DataBindExtender.cs.
We used to think that ASP.NET is a way too powerful than JSF. It might be still true, but not when you are accustomed to JSF and spoiled with its code practice...
Looking at both technologies from a greater distance, we now realize that they give almost the same level of comfort during development, but they are different. You can feel this after you were working for some time with one technology and now are to implement similar solution in opposite one. That is where we have found ourselves at present.
The funny thing is that we did expect some problems but in a different place. Indeed, both ASP.NET and JSF are means to define a page layout and to map input and output of business data. While with the presentation (controls, their compositions, masters, styles and so on) you can find more or less equal analogies, the differences of implementation of data binding is a kind of a pain.
We have found that data binding in ASP.NET is somewhat awkward. Its Eval and Bind is bearable in simple cases but almost unusable when you business data is less trivial, or if you have to apply custom data formatting.
In JSF, with its Expression Language, we can perform two way data binding for rather complex properties like ${data.items[index + 5].property}, or to create property adapters ${my:asSomething(data.bean, "property").Value}, or add standard or custom property converters. In contrast data binding in ASP.NET is limited to simple property path (no expressions are supported), neither custom formatters are supported (try to format number as a telephone number).
Things work well when you're designing ASP.NET application from scratch, as you naturally avoid pitfalls, however when you got existing business logic and need to expose it to the web, you have no other way but to write a lot of code behind just to smooth out the problems that ASP.NET exhibits.
Another solution would be to design something like extender control that would attach more proper data binding and formatting facilities to control properties. That would allow to make page definitions in more declarative way, like what we have now in JSF.
While porting a solution from JSF to ASP.NET we have seen an issue with synchronization of access to a data stored in a session from multiple requests.
Consider a case when you store a business object in a session.
Going through the request lifecycle we observe that this business object may be accessed at different stages: data binding, postback event handler, security filters, other.
Usually this business object is mutable and does not assume concurent access. Browsers, however, may easily issue multiple requests to the same session at the same time. In fact, such behaviour, is not even an exception, as browsers nowadays are often sending concurrent requests.
In the JSF we're using a sync object, which is part of business object itself; lock it and unlock at the begin and at the end of a request correspondingly. This works perfectly as JSF guarantees that:
- lock is released after it's acquired (we use request scope bean with
@PostConstruct and @PreDestroy annotations to lock and unlock);
- both lock and unlock take place in the same thread.
ASP.NET, in contrast, tries to be more asynchronous, and allows for different stages of request to take place in different threads. This could be seen indirectly in the documentation, which does not give any commitments in this regards, and with code inspection where you can see that request can begin in one thread, and a next stage can be queued for the execution into the other thread.
In addition, ASP.NET does not guarantee that if BeginRequest has been executed then EndRequest will also run.
The conclusion is that we should not use locks to synchronize access to the same session object, but rather try to invent other means to avoid data races.
Update msdn states:
Concurrent Requests and Session State
Access to ASP.NET session state is exclusive per session, which means that if two different users make concurrent requests, access to each separate session is granted concurrently. However, if two concurrent requests are made for the same session (by using the same SessionID value), the first request gets exclusive access to the session information. The second request executes only after the first request is finished. (The second session can also get access if the exclusive lock on the information is freed because the first request exceeds the lock time-out.)
This means that the required synchronization is already built into ASP.NET. That's good.
We have implemented report parser in C#. Bacause things are spinned around C#, a
schema definition is changed.
We have started from classes defining a report definition tree, annotated these
classes for xml serialization, and, finally, produced xml schema for such tree.
So, at present, it is not an xml schema with annotations but a separate xml
schema.
In addition we have defined APIs:
- to enumerate report data (having report definition and report data one can get
IEnumerable<ViewValue> to iterate report data in structured form);
- to read report through
XmlReader, which allows, for example, to
have report as input for an xslt tranformation.
- to write report directly into
XmlWriter.
An example of report definition as C# code is:
MyReport.cs. The very same report definition but serialized into xml is
my-report.xml. A generated xml schema for a report definition is:
schema0.xsd.
The good point about this solution is that it's already flexible enough to
describe every report layout we have at hands, and it's extendable. Our
measurments show that report parsing is extremely fast and have very small
memory footprint due to forward only nature of report definitions.
From the design point of view report definition is a view of original text data
with view info attached.
At present we have defined following views:
- Element - a named view to generate output from a content view;
- Content - a view to aggregate other views together;
- Choice - a view to produce output from one of content views;
- Sequence - a view to sequence input view by key expressions, and to attach an
index to each sequence item;
- Iterator - a view to generate output from input view while some condition is
true, and to attach an iteration index to each part of output view;
- Page - a view to remove page headers and footers in the input view, and to
attach an index to each page;
- Compute - a named view to produce result of evaluation of expression as output
view;
- Data - a named view to produce output value from some bounds of input view,
and optionally to convert, validate and format the value.
To specify details of definitions there are:
- expressions to deal with integers:
Add, Div,
Integer, MatchProperty, Max, Min,
Mod, Mul, Neg, Null,
Sub, VariableRef, ViewProperty, Case;
- conditions to deal with booleans:
And, EQ, GE,
GT, IsMatch, LE, LT,
NE, Not, Or.
At present there is no a specification of a report definitions. Probably, it's
the most complex part to create such a spec for a user without deep knowledge.
At present, our idea is that one should use xml schema (we should polish
generated schema) for the report definition and schema aware editor to build
report definitions. That's very robust approach working perfectly with
languages xom.
C# sources can be found at:
ReportLayout.zip including report definition classes and a sample report.
We're facing a task of parsing reports produced from legacy applications and
converting them into a structured form, e.g. into xml. These xml files can be
processed further with up to date tools to produce good looking reports.
Reports at hands are of very different structure and of size: from a couple of KB
to a several GB. The good part is that they mostly have a tabular form, so it's
easy to think of specific parsers in case of each report type.
Our goal is to create an environment where a less qualified person(s) could
create and manage such parsers, and only rarely to engage someone who will handle
less untrivial cases.
Our analysis has shown that it's possible to write such parser in almost any
language: xslt, C#, java.
Our approach was to create an xml schema annotations
that from one side define a data structure, and from the other map report
layout. Then we're able to create an xslt that will generate either xslt, C#, or
java parser according to the schema definitions. Because of
languages
xom, providing XML Object Model and serialization stylesheets for C# and
java, it does not really matter what we shall generate xslt or C#/java, as
code will look the same.
The approach we're going to use to describe reports is
not as powerfull as conventional parsers. Its virtue, however, is simplicity of
specification.
Consider a report sample (a data to extract is in bold):
1 TITLE ... PAGE: 1
BUSINESS DATE: 09/30/09 ... RUN DATE: 02/23/10
CYCLE : ITD RUN: 001 ... RUN TIME: 09:22:39
CM BUS ...
CO NBR FRM FUNC ...
----- ----- ----- -----
XXX 065 065 CLR ...
YYY ...
...
1 TITLE ... PAGE: 2
BUSINESS DATE: 09/30/09 ... RUN DATE: 02/23/10
CYCLE : ITD RUN: 001 ... RUN TIME: 09:22:39
CM BUS ...
CO NBR FRM FUNC ...
----- ----- ----- -----
AAA NNN MMM PPP ...
BBB ...
...
* * * * * E N D O F R E P O R T * * * * *
We're approaching to the report through a sequence of views (filters) of this
report. Each veiw localizes some report data either for the subsequent
filterring or for the extraction of final data.
Looking into the example one can build following views of the report:
- View of data before the "E N D O F R E P O R T" line.
- View of remaining data without page headers and footers.
- Views of table rows.
- Views of cells.
A sequence of filters allows us to build a pipeline of transformations of
original text. This also allows us to generate a clean xslt, C# or java code
to parse the data.
At first, our favorite language for such parser was xslt.
Unfortunatelly, we're dealing with Saxon xslt implementation, which is not very
strong in streaming processing. Without a couple of extension functions to
prevent caching, it tends to cache whole input in the memory, which is not
acceptable.
At present we have decided to start from C# code, which is pure C# naturally.
:-)
Code still is in the development but at present we would like to share the xml
schema annotations describing report layout:
report-mapping.xsd, and a sample of report description:
test.xsd.
To see that the problem with Generator functions in xslt
is a bit more complicated compare two functions.
The first one is quoted from the earlier post:
<xsl:function name="t:generate" as="xs:integer*">
<xsl:param name="value" as="xs:integer"/>
<xsl:sequence select="$value"/>
<xsl:sequence select="t:generate($value * 2)"/>
</xsl:function>
It does not work in Saxon: crashes with out of memory.
The second one is slightly modified version of the same function:
<xsl:function name="t:generate" as="xs:integer*">
<xsl:param name="value" as="xs:integer"/>
<xsl:sequence select="$value + 0"/>
<xsl:sequence select="t:generate($value * 2)"/>
</xsl:function>
It's working without problems. In first case Saxon decides to cache all
function's output, in the second case it decides to evaluate data lazily on
demand.
It seems that optimization algorithms implemented in Saxon are so plentiful and
complex that at times they fool one another. :-)
See also:
Generator functions
There are some complications with streamed tree that we have implemented in
saxon. They are due to the fact that only a view of input data is available at
any time. Whenever you access some element that's is not available you're
getting an exception.
Consider an example. We have a log created with java logging. It looks like
this:
<log>
<record>
<date>...</date>
<millis>...</millis>
<sequence>...</sequence>
<logger>...</logger>
<level>INFO</level>
<class>...</class>
<method>...</method>
<thread>...</thread>
<message>...</message>
</record>
<record>
...
</record>
...
We would like to write an xslt that returns a page
of log as html:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="http://www.nesterovsky-bros.com/xslt/this"
xmlns="http://www.w3.org/1999/xhtml"
exclude-result-prefixes="xs t">
<xsl:param name="start-page" as="xs:integer" select="1"/>
<xsl:param name="page-size" as="xs:integer" select="50"/>
<xsl:output method="xhtml" byte-order-mark="yes" indent="yes"/>
<!-- Entry point. -->
<xsl:template match="/log">
<xsl:variable name="start" as="xs:integer"
select="($start-page - 1) * $page-size + 1"/>
<xsl:variable name="records" as="element()*"
select="subsequence(record, $start, $page-size)"/>
<html>
<head>
<title>
<xsl:text>A log file. Page: </xsl:text>
<xsl:value-of select="$start-page"/>
</title>
</head>
<body>
<table border="1">
<thead>
<tr>
<th>Level</th>
<th>Message</th>
</tr>
</thead>
<tbody>
<xsl:apply-templates mode="t:record" select="$records"/>
</tbody>
</table>
</body>
</html>
</xsl:template>
<xsl:template mode="t:record" match="record">
<!-- Make a copy of record to avoid streaming access problems. -->
<xsl:variable name="log">
<xsl:copy-of select="."/>
</xsl:variable>
<xsl:variable name="level" as="xs:string"
select="$log/record/level"/>
<xsl:variable name="message" as="xs:string"
select="$log/record/message"/>
<tr>
<td>
<xsl:value-of select="$level"/>
</td>
<td>
<xsl:value-of select="$message"/>
</td>
</tr>
</xsl:template>
</xsl:stylesheet>
This code does not work. Guess why? Yes, it's subsequence(), which is too greedy.
It always wants to know what's the next node, so it naturally skips a content of
the current node. Algorithmically, such saxon code could be rewritten, and could possibly work
better also in modes other than streaming.
A viable workaround, which does not use subsequence, looks rather untrivial:
<!-- Entry point. -->
<xsl:template match="/log">
<xsl:variable name="start" as="xs:integer"
select="($start-page - 1) * $page-size + 1"/>
<xsl:variable name="end" as="xs:integer"
select="$start + $page-size"/>
<html>
<head>
<title>
<xsl:text>A log file. Page: </xsl:text>
<xsl:value-of select="$start-page"/>
</title>
</head>
<body>
<table border="1">
<thead>
<tr>
<th>Level</th>
<th>Message</th>
</tr>
</thead>
<tbody>
<xsl:sequence select="
t:generate-records(record, $start, $end, ())"/>
</tbody>
</table>
</body>
</html>
</xsl:template>
<xsl:function name="t:generate-records" as="element()*">
<xsl:param name="records" as="element()*"/>
<xsl:param name="start" as="xs:integer"/>
<xsl:param name="end" as="xs:integer?"/>
<xsl:param name="result" as="element()*"/>
<xsl:variable name="record" as="element()?" select="$records[$start]"/>
<xsl:choose>
<xsl:when test="(exists($end) and ($start > $end)) or empty($record)">
<xsl:sequence select="$result"/>
</xsl:when>
<xsl:otherwise>
<!-- Make a copy of record to avoid streaming access problems. -->
<xsl:variable name="log">
<xsl:copy-of select="$record"/>
</xsl:variable>
<xsl:variable name="level" as="xs:string"
select="$log/record/level"/>
<xsl:variable name="message" as="xs:string"
select="$log/record/message"/>
<xsl:variable name="next-result" as="element()*">
<tr>
<td>
<xsl:value-of select="$level"/>
</td>
<td>
<xsl:value-of select="$message"/>
</td>
</tr>
</xsl:variable>
<xsl:sequence select="
t:generate-records
(
$records,
$start + 1,
$end,
($result, $next-result)
)"/>
</xsl:otherwise>
</xsl:choose>
</xsl:function>
Here we observed the greediness of saxon, which too early tried to consume
more input than it's required. In the other cases we have seen that it may
defer actual data access to the point when there is no data anymore.
So, without tuning internal saxon logic it's possible but not easy to write
stylesheets that exploit streaming features.
P.S. Updated sources are at
streamedtree.zip
At some point we needed to have an array with volatile elements in java.
We knew that such beast is not found in the java world. So we searched
the Internet and found the answers that are so wrong, and introduce so obscure
threading bugs that the guys who provided them would better hide them and run immediately to fix their
buggy programs...
The first one is
Volatile arrays
in Java. They suggest such solution:
volatile int[] arr = new int[...];
...
arr[4] = 100;
arr = arr;
The number two:
What Volatile Means in Java
A guy assures that this code works:
Fields:
int answer = 0;
volatile boolean ready = false;
Thread1:
answer = 42;
ready = true;
Thread2:
if (ready)
{
print(answer);
}
They are very wrong! Non volatile access can be reordered by the implementation.
See Java's
Threads and Locks:
The rules for volatile variables effectively require that main memory be touched exactly once for each use or assign of a volatile variable by a thread, and that main memory be touched in exactly the order dictated by the thread execution semantics. However, such memory actions are not ordered with respect to read and write actions on nonvolatile variables.
They probably thought of locks when they argued about volatiles:
a lock action acts as if it flushes all variables from the
thread's working memory; before use they must be assigned or loaded from main
memory.
P.S. They would better recommend
AtomicReferenceArray.
When time has come to process big xml log files we've decided to implement
streamable tree in saxon the very same way it was implemented in .net eight
years ago (see
How would we approach to streaming facility in xslt).
It's interesting enough that the implementation is similar to one of
composable tree. There a node never stores a reference to a parent, while in
the streamed tree no references to children are stored. This way only a limited
subview of tree is available at any time. Implementation does not support
preceding and preceding-sibling axes. Also, one cannot navigate to a node that
is out of scope.
Implementation is external (there are no changes to saxon itself). To use it one
needs to create an instance of DocumentInfo, which pulls data from
XMLStreamReader, and
to pass it as an input to a transformation:
Controller controller =
(Controller)transformer;
XMLInputFactory factory = XMLInputFactory.newInstance();
StreamSource inputSource = new StreamSource(new File(input));
XMLStreamReader reader = factory.createXMLStreamReader(inputSource);
StaxBridge bridge = new StaxBridge();
bridge.setPipelineConfiguration(
controller.makePipelineConfiguration());
bridge.setXMLStreamReader(reader);
inputSource = new DocumentImpl(bridge);
transformer.transform(inputSource, new StreamResult(output));
This helped us to format an xml log file of arbitrary size. An xslt like this can do the work:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns="http://www.w3.org/1999/xhtml"
exclude-result-prefixes="xs">
<xsl:template match="/log">
<html>
<head>
<title>Log</title>
</head>
<body>
<xsl:apply-templates/>
</body>
</html>
</xsl:template>
<xsl:template match="message">
...
</xsl:template>
<xsl:template match="message[@error]">
...
</xsl:template>
...
</xsl:stylesheet>
Implementation can be found at:
streamedtree.zip
By the generator we assume a function that produces an infinitive output
sequence for a particular input.
That's a rather theoretical question, as xslt does not allow infinitive
sequence, but look at the example:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:t="http://www.nesterovsky-bros.com/xslt"
exclude-result-prefixes="xs t">
<xsl:template match="/">
<xsl:variable name="value" as="xs:string" select="'10100101'"/>
<xsl:variable name="values" as="xs:integer+"
select="t:generate(1)"/>
<!--<xsl:variable name="values" as="xs:integer+">
<xsl:call-template name="t:generate">
<xsl:with-param name="value" select="1"/>
</xsl:call-template>
</xsl:variable>-->
<xsl:variable name="integer" as="xs:integer" select="
sum
(
for $index in 1 to string-length($value)
return
$values[$index][substring($value, $index, 1) = '1']
)"/>
<xsl:message select="$integer"/>
</xsl:template>
<xsl:function name="t:generate" as="xs:integer*">
<xsl:param name="value" as="xs:integer"/>
<xsl:sequence select="$value"/>
<xsl:sequence select="t:generate($value * 2)"/>
</xsl:function>
<!--<xsl:template name="t:generate" as="xs:integer*">
<xsl:param name="value" as="xs:integer"/>
<xsl:sequence select="$value"/>
<xsl:call-template name="t:generate">
<xsl:with-param name="value" select="$value * 2"/>
</xsl:call-template>
</xsl:template>-->
</xsl:stylesheet>
Here the logic uses such a generator and decides by itself where to break.
Should such code be valid?
From the algorithmic perspective example would better to work, as separation of
generator logic and its use are two different things.
Lately, after playing a little with saxon tree models, we thought that design
would be more cleaner and implementation faster if NamePool were implemented
differently.
Now, saxon is very pessimistic about java objects thus it prefers to encode
qualified names with integers. The encoding and decoding is done in the
NamePool. Other parts of code use these integer values.
Operations done over these integers are:
- equality comparision of two such integers in order to check whether to
qualified or extended names are equal;
- get different parts of qualified name from
NamePool.
We would design this differently. We would:
- create a
QualifiedName class to store all name parts.
- declare
NamePool to create and cache QualifiedName instances.
This way:
- equality comparision would be a reference comparision of two instances;
- different parts of qualified name would become a trivial getter;
- contention of such name pool would be lower.
That's the implementation we would propose:
QualifiedName.java,
NameCache.java
Earlier, in the entry "Inline functions in xslt 2.1" we've described an implementation of xml tree model that may share subtrees among different trees.
This way, in a code:
<xsl:variable name="elements" as="element()*" select="..."/>
<xsl:variable name="result" as="element()">
<result>
<xsl:sequence select="$elements"/>
</result>
</xsl:variable>
the implementation shares internal representation among $elements
and subtree of $result. From the perspective of xslt it
looks as completely different subtrees with different node identities, which is
in the accordance with its view of the world.
After a short study we've decided to create a research implementation of this
tree model in saxon. It's took only a couple of days to introduce a minimal changes
to engine, to refactor linked tree into a new composable tree, and to
perform some tests.
In many cases saxon has benefited immediately from this new tree model, in some
other cases more tunings are required.
Our tests've showed that this new tree performed better than
linked tree, but a little bit worser than tiny tree. On the other hand, it's
obvious that conventional code patterns avoid subtree copying, assuming it's
expensive operation, thus one should rethink some code practices to benefit from
composable tree.
Implementation can be downloaded at:
saxon.composabletree.zip
From the web we know that xslt WG is thinking now on how to make xslt more friendly to a huge documents. They will probably introduce some xslt syntax to allow implementation to be ready for a such processing.
They will probably introduce an indicator marking a specific mode for streaming. XPath in this mode will probably be restricted to a some its subset.
The funny part is that we have implemented similar feature back in 2002 in .net. It was called XPathForwardOnlyNavigator.
Implementation stored only several nodes at a time (context node and its ancestors), and read data from XmlReader perforce. Thus one could navigate to ancestor elements, to children, and to the following siblings, but never to any previous node. When one tried to reach a node that was already not available we threw an exception.
It was simple, not perfect (too restrictive) but it was pluggable in .net's xslt, and allowed to process files of any size.
That old implementation looks very attractive even now in 2010. We expect that WG with their decisions will not rule out such or similar solutions, and will not force implementers to write alternative engine for xslt streaming.
Xslt 1.0 has been designed based on the best intentions. Xslt 2.0 got a legacy
baggage.
If you're not entirely concentrated during translation of your algorithms into
xslt 2.0 you can get into trap, as we did.
Consider a code snapshot:
<xsl:variable name="elements" as="element()+">
<xsl:apply-templates/>
</xsl:variable>
<xsl:variable name="converted-elements" as="element()+"
select="$elements/t:convert(.)"/>
Looks simple, isn't it?
Our intention was to get converted elements, which result from some
xsl:apply-templates logic.
Well, this code works... but rather sporadically, as results are often in wrong
order! This bug is very close to
what is called a
Heisenbug.
So, where is the problem?
Elementary, my dear Watson:
xsl:apply-templates constructs a sequence of rootless
elements.
$elements/t:convert(.) converts elements and orders them in document order.
Here is a tricky part:
The relative order of nodes in distinct trees is stable but
implementation-dependent...
Clearly each rootless element belongs to a unique tree.
After that we have realized what the problem is, code has been
immediately rewritten as:
<xsl:variable name="elements" as="element()+">
<xsl:apply-templates/>
</xsl:variable>
<xsl:variable name="converted-elements" as="element()+" select="
for $element in $elements return
t:convert($element)"/>
P.S. Taking into an accout a size of our xslt code base, it took a half an hour to
localize the problem. Now, we're at position to review all uses of slashes
in xslt. As you like it?
Inline functions in xslt 2.1 look often as a some strange aberration. Sure,
there are very usefull cases when they are delegates of program logic
(e.g. comparators, and filters), but often (probably more often) we can see that
it's use is to model data structures.
As an example, suppose you want to model a structure with three properties say
a, b, and c. You implement this creating functions that wrap and unwrap the
data:
function make-data($a as item(), $b as item(), $c
as item()) as function() as item()+
{
function() { $a, $b, $c }
}
function a($data as function() as item()+) as item()
{
$data()[1]
}
function b($data as function() as item()+) as item()
{
$data()[2]
}
function c($data as function() as item()+) as item()
{
$data()[3]
}
Clever?
Sure, it is! Here, we have modeled structrue with the help of sequence, which we
have wrapped into a function item.
Alas, clever is not always good (often it's a sign of a bad). We just wanted to
define a simple structure. What it has in common with function?
There is a distance between what we want to express, designing an algorithm, and
what we see looking at the code. The greater the distance, the more efforts are
required to write, and to read the code.
It would be so good to have simpler way to express such concept as a structure.
Let's dream a little. Suppose you already have a structure, and just want to
access its members. An idea we can think about is an xpath like access method:
$data/a, $data/b, $data/c
But wait a second, doesn't
$data looks very like an xml element, and its accessors are just
node tests?
That's correct, so data constructor may coincide with element constructor.
Then what pros and cons of using of xml elements to model structures?
Pros are: existing xml type system, and sensibly looking code (you just
understand that here we're constructing a structure).
Cons are: xml trees are implemented the way that does not assume fast (from the
perfromace perspective) composition, as when you construct a structure a copy of
data is made.
But here we observe that "implemented" is very important word in this context.
If xml tree implementation would not store reference to the parent node then
subtrees could be composed very efficiently (note that tree is assumed to be
immutable). Parent node could be available through a tree navigator, which would
contain reference to a node itself and to a parent tree navigator (or to store child parent map somewhere near the root).
Such tree structure would probably help not only in this particular case but
also with other conventional xslt code patterns.
P.S. Saxon probably could implement its NodeInfo this way.
Update: see also Custom tree model.
A while ago we have proposed to introduce maps as built-in types in
xpath/xquery type system: Tuples and maps.
The suggestion has been declined (probably our arguments were not convincing). We,
however, still think that maps along with sequences are primitives, required to
build sensible (not obscure one) algorithms. To see that map is at times is the
best way to resolve the problems we shall refer to an utility function to
allocate names in scope. Its signature looks like this:
<!--
Allocates unique names in the form $prefix{number}?.
Note that prefixes may coincide.
$prefixes - a name prefixes.
$names - allocated names pool.
$name-max-length - a longest allowable name length.
Returns unique names.
-->
<xsl:function name="t:allocate-names" as="xs:string*">
<xsl:param name="prefixes" as="xs:string*"/>
<xsl:param name="names" as="xs:string*"/>
<xsl:param name="name-max-length" as="xs:integer?"/>
Just try to program it and you will find yourselves coding something like one defined at
cobolxom.
To be efficient such maps should provide, at worst, a logarithmic operation
complexity:
- Access to the map through a key (and/or by index) - complexity is LnN;
- Creation of a new map with added or removed item - complexity is LnN;
- Construction of the map from ordered items - complexity is O(N);
- Construction of the map from unordered items - complexity is O(N*LnN);
- Total enumeration of all items - complexity is O(N*LnN);
These performance metrics are found in many functional and procedural
implementations of the maps. Typical RB and AVL tree based maps satisfy these
restrictions.
What we suggest is to introduce map implementation into the exslt2 (assuming
inline functions are present). As a sketch we have implemented pure
AVL Tree in Xslt 2.0:
We do not assume that implementation like this should be used, but rather think
of some extension function(s) that provides a better performance.
What do you think?
The story about immutable tree would not be complete without xslt
implementation. To make it possible one needs something to approxomate tree
nodes. You cannot implement such consruct efficiently in pure xslt 2.0 (it would
be either unefficient or not pure).
To isolate the problem we have used tuple interface:
tuple:ref($items as item()*) as item() - to wrap items into a tuple;
tuple:deref($tuple as item()?) as item()* - to unwrap items from a tuple;
tuple:is-same($first as item(), $second as item()) as xs:boolean
- to test
whether two tuples are the same.
and defined inefficient implementation based on xml elements. Every other part
of code is a regular AVL algorithm implementation.
We want to stress again that even assuming that there is a good tuple
implementation we would prefer built-in associative container implementation.
Why the heck you need to include about 1000 lines of code just to use a map?
Source code is:
We like Visual Studio very much, and try to adopt new version earlier.
For the last time our VS's use pattern is centered around xml and xslt. In our opinion VS 2008 is the best xslt 2 editor we have ever seen even with lack of support of xslt 2.0 debugging.
Unfortunatelly, that is still a true when VS 2010 is almost out. VS 2008 is just 2 - 3 times faster. You can observe this working with xslt files like those in languages-xom.zip (1000 - 2000 rows). Things just become slow.
We still hope that VS 2010 will make a final effort to outdo what VS 2008 has already delivered.
Why do we return to this theme again?
Well, it's itching!
In cobolxom there is an utility function to allocate names in scope. Its signature looks like this:
<!-- Allocates unique names in the form $prefix{number}?. Note that prefixes may coincide. $prefixes - a name prefixes. $names - allocated names pool. $name-max-length - a longest allowable name length. Returns unique names. --> <xsl:function name="t:allocate-names" as="xs:string*"> <xsl:param name="prefixes" as="xs:string*"/> <xsl:param name="names" as="xs:string*"/> <xsl:param name="name-max-length" as="xs:integer?"/>
We have created several different implementations (all use recursion). Every implementation works fair for relatively small input sequences, say N < 100, but we have cases when there are about 10000 items on input. Algorithm's worst case complexity, in absence of associative containers, is O(N*N), and be sure it's an O-o-o-oh... due to xslt engine implementation.
If there were associative containers with efficient access (complexity is O(LogN)), and construction of updated container (complexity is also O(LogN)) then implementation would be straightforward and had complexity O(N*LogN).
Given:
public class N
{
public readonly N next;
}
What needs to be done to construct a ring of N: n1 refers to n2, n2 to n3, ... nk to n1? Is it possible?
To end with immutable trees, at least for now, we've implemented IDictionary<K, V>.
It's named Map<K, V>. Functionally it looks very like SortedDictionary<K, V>.
there are some differences, however:
Map in contrast to SortedDictionary is very cheap on
copy.
- Bacause
Map is based on AVL tree, which is more rigorly balanced
than RB tree, so it's a little bit faster asymptotically for lookup than SortedDictionary,
and a little bit slower on modification.
- Due to the storage structure: node + navigator,
Map consumes less memory than
SortedDictionary, and is probably cheaper for GC (simple garbage
graphs).
- As AVL tree stores left and right subtree sizes, in contrast to a "color" in
RB tree, we able to index data in two ways: with integer index, and with key
value.
Sources are:
Update:
It was impossible to withstand temptation to commit some primitive performance
comparision. Map outperforms SortedDictionary both in population and in access.
this does not aggree with pure algorithm's theory, but there might be other
unaccounted factors: memory consumption, quality of implementation, and so on.
Program.cs is updated with measurements.
Update 2:
More occurate tests show that for some key types Map's faster, for others
SortedDictionary's faster. Usually Map's slower during population (mutable AVL
tree navigator may fix this). the odd thing is that Map<string, int> is faster
than SortedDictionary<string, int> both for allocaction and for access. See
excel report.
Update 3:
Interesing observation. The following table shows maximal and
average tree heights for different node sizes in AVL and RB trees after a random population:
|
AVL |
RB |
| Size |
Max |
Avg |
Max |
Avg |
| 10 |
4 |
2.90 |
5 |
3.00 |
| 50 |
7 |
4.94 |
8 |
4.94 |
| 100 |
8 |
5.84 |
9 |
5.86 |
| 500 |
11 |
8.14 |
14 |
8.39 |
| 1000 |
12 |
9.14 |
16 |
9.38 |
| 5000 |
15 |
11.51 |
18 |
11.47 |
| 10000 |
16 |
12.53 |
20 |
12.47 |
| 50000 |
19 |
14.89 |
23 |
14.72 |
| 100000 |
20 |
15.90 |
25 |
15.72 |
| 500000 |
25 |
18.26 |
28 |
18.27 |
| 1000000 |
25 |
19.28 |
30 |
19.27 |
Here, according with theory, the height of AVL tree is shorter than the height
of RB tree. But what is most interesting is that the depth of an "average
node". This value describes a number of steps required to find a random key. RB
tree is very close and often is better than AVL in this regard.
It was obvious as hell from day one of generics that there will appear obscure
long names when you will start to parametrize your types. It was the easiest
thing in the world to take care of this in advanvce. Alas, C# inherits C++'s bad
practices.
Read Associative containers in a functional languages
and
Program.cs to see what we're talking about.
Briefly, there is a pair (string, int), which in C# should be declared as:
System.Collections.Generic.KeyValuePair<string, int>
Obviously we would like to write it in a short way. These are our attempts, which
fail:
1. Introduce generic alias Pair<K, V>:
using System.Collections.Generic;
using Pair<K, V> = KeyValuePair<K, V>;
2. Introduce type alias for a generic type with specific types.
using System.Collections.Generic;
using Pair = KeyValuePair<string, int>;
And this is only one that works:
using Pair = System.Collections.Generic.KeyValuePair<string, int>;
Do you think is it bearable? Well, consider the following:
- There is a generic type
ValueNode<T>, where T
should be Pair.
- There is a generic type
TreeNavigator<N>, where N is should be ValueNode<Pair>.
The declaration looks like this:
using Pair = System.Collections.Generic.KeyValuePair<string, int>;
using Node = NesterovskyBros.Collections.AVL.ValueNode<
System.Collections.Generic.KeyValuePair<string, int>>;
using Navigator = NesterovskyBros.Collections.AVL.TreeNavigator<
NesterovskyBros.Collections.AVL.ValueNode<
System.Collections.Generic.KeyValuePair<string, int>>>;
Do you still think is it acceptable?
P.S. Legacy thinking led C#'s and java's designers to the use of word "new" for the
object construction. It is not required at all. Consider new Pair("A", 1) vs Pair("A", 1).
C++ prefers second form. C# and java always use the first one.
Continuing with the post "Ongoing xslt/xquery spec update"
we would like to articulate what options regarding associative containers do we
have in a functional languages (e.g. xslt, xquery), assuming that variables are
immutable and implementation is efficient (in some sense).
There are three common implementation techniques:
- store data (keys, value pairs) in sorted array, and use binary search to
access values by a key;
- store data in a hash map;
- store data in a binary tree (usually RB or AVL trees).
Implementation choice considerably depends on operations, which are taken over
the container. Usually these are:
- construction;
- value lookup by key;
- key enumeration (ordered or not);
- container modification (add and remove data into the
container);
- access elements by index;
Note that modification in a functional programming means a creation of a new
container, so here is a
division:
- If container's use pattern does not include modification, then probably the
simplest solution is to build it as an ordered sequence of
pairs, and use binary search to access the data. Alternatively, one could
implement associative container as a hash map.
- If modification is essential then neither ordered sequence of pairs, hash map
nor classical tree implementation can be used, as they are either too slow
or too greedy for a memory, either during modification or during access.
On the other hand to deal with container's modifications one can build
an implementation, which uses "top-down" RB
or AVL trees. To see the
difference consider a classical tree structure and its functional variant:
|
Classical |
Functional |
| Node structure: |
node
parent
left
right
other data |
node
left
right
other data |
| Node reference: |
node itself |
node path from a root of a tree |
| Modification: |
either mutable or requires a completely new tree |
O(LnN) nodes are created
|
Here we observe that:
- one can implement efficient map (lookup time no worse than O(LnN)) with no
modification support, using ordered array;
- one can implement efficient map with support of modification, using immutable binary tree;
- one can implement all these algorithms purely in xslt and xquery (provided that inline
functions are supported);
- any such imlementation will lose against the same implementation
written in C++, C#, java;
- the best implementation would probably start from sorted array and
will switch to binary tree after some size threshold.
Here we provide a C# implementation of a functional AVL tree, which also supports
element indexing:
Our intention was to show that the usual algorithms for associative
containers apply in functional
programming; thus a feature complete functional language must support
associative containers to make development more conscious, and to free a
developer from inventing basic things existing already for almost a half of
century.
Several years ago we have started a new project. We do not like neither hate
any particular language, thus the decision what language to use was pragmatical:
xslt 2.0 fitted perfectly.
At present it's a solid volume of xslt code. It exhibits all the virtues of any
other good project in other language: clean design, modularity, documentation,
sophisticationless (good code should not be clever).
Runtime profile of the project is that it deals with xml documents with sizes
from a few dozens of bytes to several megabytes, and with xml schemas from
simple ones like a tabular data, and to rich like xhtml and untyped. Pipeline
of stylesheets processes gigabytes of data stored in the database and in files.
All the bragging above is needed here to introduce the context for the following
couple of lines.
The diversity of load conditions and a big code base, exposed xslt engine of
choice to a good proof test. The victim is Saxon. In the course of project we
have found and reported many bugs. Some of them are nasty and important, and
others are bearable. To Michael Kay's credit (he's owner of Saxon) all bugs are being
fixed promtly (see
the last one).
Such project helps us to understand a weak sides of xslt (it seems sometimes they, in WG, lack such experience, which should lead them through).
Well, it has happened so that we're helping to Saxon project. Unintentionally,
however! :-)
P.S.
About language preferences.
Nowdays we're polishing a
COBOL
generation. To this end we have familiarized ourselves with this language.
That's the beatiful language. Its straightforwardness helps to see the evolution
of computer languages and to understand what and why today's languages try to
timidly hide.
In spite of the fact that our last projects are being developed in Java, the .NET is definitly our favorite platform.
In a twitter I saw the phrase: "Java the language is a stagnant mess". It's said in favour of C#. It's true that C# significantly affects now even on Java (let's remember generics, jaxb, web services, etc.), but in my opinion, the C# won't be the leading language for worldwide enterprise applications in the nearest future.
One of causes is that the main platform for .NET still is Windows. The situation could be changed by Mono project, but I think there are yet not enough projects on platforms other than Windows.
My guess is confirmed by some of observations that I did as a software engineer of an IT company. Our company performs different software porting projects from legacy programming languages like COBOL, ADSO, Natural etc. into up to date languages like Java, C# etc. It worth to say that clients rarely select to migrate to .NET despite to our advices.
The main reason of such choice, according to most of our clients, is that they want to be platform independent and only Java gives them this choice.
It worth for Microsoft to think about cooperation with Mono in order to make .NET really platform indpendent, otherwise C# will always be a step behind Java despite apparent advantages of C# as a programming language.
Suppose you have a library, which is out in the production and is used by many clients. At the same time the library evolves: API is extended, bugs are being fixed, code becomes faster and cleaner, bla bla bla...
At some point you're fixing some important bug that's been hiding for a long time in the bowels of your library. You're happy that you've spotted it before clients got into troubles. You're notifying all the clients that there is an important fix, and that they need to update the library.
What do you think you hear in return?
Well, we're not perfect, there are bugs in our software. We and our clients realize this. Nothing will eliminate bugs to creep into a code from time to time.
That's a train of thought of a particular client:
We agree that there is a bug and that it has to be fixed. We, however, want to touch the library in a minimal way, as who knows what other new bugs had they introduced, so let's ask them to fix this particular bug in our version of the library.
That's fair from the client's perspective. They don't want better code, they just want that particular bug fixed!
For us, however, this means branching some old version of the library, fixing bug and supporting this branch for the particular client. It's fair to expect similar position from each client, thus should we create and support library branches per client, and branch a main branch for a new client only?
For us (Arthur and Vladimir) it looks as enormous waste of resources. We (our company) should either hire more and more scaled people or experience gradual slowdown of support and development.
Our answer could be obvious if not position of top managers who value client relations so much that they easily promise whatever client wishes. No arguments that latest version is better tested, more conforming to specifications, more reliable, faster and so on are accepted. The main argument against our position is that the client's applications run in the production, and no new potential bugs are acceptable.
Here is our dilemma: we neither can convince the client (more precisely our managers) that we're right, nor are convinced with their arguments...
From time to time I like to tell, to express (or even to brag about) what keeps our minds busy. Sometimes it's even useful, and you unexpectedly discover, you are on the edge of community interests, other time you are outsider.
So, what are we doing now? Well, already several years we are working in Multiconn. One of goals of our company is to help to expose mainframe functionality. Many different solutions of this task are available, and even Multiconn provides several different approaches. Now go right to our business - web service as delegate of mainframe.
To be honest, I think mainframes are legacy monsters, dinosaurs. I realize, it's only a bad perspective, but the first experience is often very strong. For us as developers mainframe is a source of data in different formats (sure formats are legacy also): COBOL records, terminal messages, and so on. Our idea is to allow to mainframe to rest in peace and to gracefully consume its data. Combining consumed data in xml form and operation bindings that describe mainframe's application flow, we arrive to a view of mainframe application as a web service.
To achieve this goal we have worked up extension to xml schema that allows mapping schema elements to a data. It should be pointed, it's perfectly legal to extend xml schema. One way to extend it, is to use elements in custom namespace in appInfo element.
The following was to create importer of schema with our extensions. It's somehow similar to xsd.exe tool. Our tool generates classes with annotations XmlXxxAttribute for xml serialization and LayoutXxxAttribute (this is our custom attributes) for instance serializing and deserializing to and from a mainframe data.
The next stage was to create data serializer. This is a counterpart to XmlSerializer. It inspects class meta-data and creates plan for serialization and deserialization.
On the next stage we have worked up wsdl schema bindings for our technology.
After that we have created tool to import wsdl (similar to wsdl.exe) that generates web service which passes input messages to communication layer that serializes and forwards data to mainframe, accepts and deserializes response and returns it to the web service, which in its turn returns result to a client.
The next was communication layer that interacts with mainframe. This layer consists of abstract (general) sublayer and specializations which support different patterns of communications.
As result we have web service implemented in .NET. This web service represents some mainframe's application.
There are plans to generate similar web services in the Java.
|