RSS 2.0
Sign In
# Thursday, September 29, 2011

A couple of weeks ago, we have suggested to introduce a enumerator function into the XPath (see [F+O30] A enumerator function):

I would like the WG to consider an addition of a function that turns a sequence into a enumeration of values.

Consider a function like this:  fn:enumerator($items as item()*) as function() as item()?;

alternatively, signature could be:

 fn:enumerator($items as function() as item()*) as function() as item()?;

This function receives a sequence, and returns a function item, which upon N's call shall return N's element of the original sequence. This way, a sequence of items is turned into a function providing a enumeration of items of the sequence.

As an example consider two functions:

a) t:rand($seed as xs:double) as xs:double* - a function producing a random number sequence;
b) t:work($input as element()) as element() - a function that generates output from it's input, and that needs random numbers in the course of the execution.

t:work() may contain a code like this:
  let $rand := fn:enumerator(t:rand($seed)),

and later it can call $rand() to get a random numbers.

Enumerators will help to compose algorithms where one algorithm communicate with other independant algorithms, thus making code simpler. The most obvious class of enumerators are generators: ordered numbers, unique identifiers, random numbers.

Technically, function returned from fn:enumerator() is nondetermenistic, but its "side effect" is similar to a "side effect" of a function generate-id() from a newly created node (see bug #13747, and bug #13494).

The idea is inspired by a generator function, which returns a new value upon each call.

Such function can be seen as a stateful object. But our goal is to look at it in a more functional way. So, we look at the algorithm as a function that produces a sequence of output, which is pure functional; and an enumerator that allows to iterate over algorithm's output.

This way, we see the function that implements an algorithm and the function that uses it can be seen as two thread of functional programs that use messaging to communicate to each other.

Honestly, we doubt that WG will accept it, but it's interesting to watch the discussion.

Thursday, September 29, 2011 11:56:05 AM UTC  #    Comments [0] -
Thinking aloud | xslt
# Wednesday, September 14, 2011

More than month has passed since we have reported a problem to the saxon forum (see Saxon optimizer bug and Saxon 9.2 generate-id() bug).

The essence of the problem is that we have constructed argumentless function to return a unique identifiers each time function is called. To achieve the effect we have created a temporary node and returned its generate-id() value.

Such a function is nondetermenistic, as we cannot state that its result depends on arguments only. This means that engine's optimizer is not free to reorder calls to such a function. That's what happens in Saxon 9.2, and Saxon 9.3 where engine elevates function call out of cycle thus producing invalid results.

Michael Kay, the author of the Saxon engine, argued that this is "a gray area of the xslt spec":

If the spec were stricter about defining exactly when you can rely on identity-dependent operations then I would be obliged to follow it, but I think it's probably deliberate that it currently allows implementations some latitude, effectively signalling to users that they should avoid depending on this aspect of the behaviour.

He adviced to raise a bug in the w3c bugzilla to resolve the issue. In the end two related bugs have been raised:

  • Bug 13494 - Node uniqueness returned from XSLT function;
  • Bug 13747 - [XPath 3.0] Determinism of expressions returning constructed nodes.

Yesterday, the WG has resolved the issue:

The Working Group agreed that default behavior should continue to require these nodes to be constructed with unique IDs. We believe that this is the kind of thing implementations can do with annotations or declaration options, and it would be best to get implementation experience with this before standardizing.

This means that the technique we used to generate unique identifiers is correct and the behaviour is well defined.

The only problem is to wait when Saxon will fix its behaviour accordingly.

Wednesday, September 14, 2011 5:54:56 AM UTC  #    Comments [0] -
Thinking aloud | xslt
# Tuesday, September 6, 2011

We're not big fans of Entity Framework, as we don't directly expose the database structure to the client program but rather through stored procedures and functions. So, EF for us is a tool to expose those stored procedures as .NET wrappers. This limited use of EF still greatly automates the data access code.

But what we have lately found is that the EF has a problem with char parameters. Namely, if you import a procedure say MyProc that accepts char(1), and then will call it through the generated wrapper, the you will see in sql profiler that char(1) parameter is passed with many trailing spaces as if it were char(8000). There isn't necessity to prove that this is highly ineffective.

We can see that the problem happens in VS 2010 designer rather than in the EF runtime, as SP's parameters are not attributed with length, see model xml (*.edmx):

<Function Name="MyProc" Schema="Data">
  ...
  <Parameter Name="recipientType" Type="char" Mode="In" />
  ...
</Function>

while if we set:

  <Parameter Name="recipientType" Type="char" MaxLength="1" Mode="In" />

the runtime starts working as expected. So the workaround is to fix model file manually.

See also: Stored Proc and Char parm

Tuesday, September 6, 2011 9:11:38 PM UTC  #    Comments [0] -
.NET | Thinking aloud | Tips and tricks
# Monday, August 29, 2011

Please welcome a new human being Masha Vladimirovna Nesterovsky!

Masha

Monday, August 29, 2011 1:54:22 PM UTC  #    Comments [0] -
Announce
# Sunday, August 28, 2011

AjaxControlToolkit has methods to access ViewState:

protected V GetPropertyValue<V>(string propertyName, V nullValue)
{
  if (this.ViewState[propertyName] == null)
  {
    return nullValue;
  }

  return (V) this.ViewState[propertyName];
}

protected void SetPropertyValue<V>(string propertyName, V value)
{
  this.ViewState[propertyName] = value;
}

...

public bool EnabledOnClient
{
  get { return base.GetPropertyValue("EnabledOnClient", true); }
  set { base.SetPropertyValue("EnabledOnClient", value); }
}

We find that code unnecessary complex and nonoptimal. Our code to access ViewState looks like this:

public bool EnabledOnClient
{
  get { return ViewState["EnabledOnClient"] as bool? ?? true); }
  set { ViewState["EnabledOnClient"] = value; }
}

Sunday, August 28, 2011 7:35:13 PM UTC  #    Comments [0] -
ASP.NET | Tips and tricks

Recently one of users of java yield return annotation has kindly informed us about some problem that happened in his environment (see Java's @Yield return annotation update).

Incidentally we have never noticed the problem earlier. Along with this issue we have found that eclipse compiler has changed in the Indigo in a way that we had to recompile the source. Well, that's a price you have to pay when you access internal API.

Updated sources can be found at Yield.zip, and compiled jars at Yield.jar (pre-Indigo), and Yield.3.7.jar (Indigo and probably higher).

See also:

Yield return feature in java
Why @Yield iterator should be Closeable
What you can do with jxom.

Sunday, August 28, 2011 7:11:45 PM UTC  #    Comments [0] -
Announce | Java | xslt
# Friday, August 12, 2011

1. query.dll vs tquery.dll

We have installed Windows Search 4 on a Windows 2003 server. The goal was to index huge compressed xml files (see Windows Search Notifications). But for some reason it did not want to index content.

No "select System.ItemUrl from SystemIndex where contains('...')" has ever returned a row.

We thought that the problem was in our protocol handler, and tried to localize it, but finally have discovered that Windows Search is not able to find anything within text files.

Registry comparision has shown that *.txt extension was indexed by the IFilter defined in the query.dll, while on the other computers, where everything worked, the implementation was in the tquery.dll.

Both libraries were present on the Windows 2003 server, so we have corrected the registry and everything has started to work.

As far as we understand query.dll is part of legacy Indexing Service, and tquery.dll is up to date implementation.

2. Search index size

We have to index a considerable amout of data. But before we can do it we have to estimate the size of index.

In the past it seems we saw somewhere a statement that search index needs a storage that's about 10% of original data for its purposes. Unfortunatelly we cannot find this estimation at present, neither we cannot find any other estimation. This complicates our planning.

To get empirical estimate we've indexed several thousands *.xml-gz files, which are gz'ed big xmls. The total size of this files is about 4.5GB. Total uncompressed size of xmls ~50GB. Xml contained about 10 millions pages of data.

According to 10% criteria we had to arrive to ~5GB search index.

But what we have discovered is that the index has grown to more than 50GB. That's very disappointing. We cannot afford such expense, as we've commited test on a tiny part of data, which increases over time.

So, the solution is to find out what's wrong, and how can it be cured, or to fulltext index only most recent subset of data.

P.S. We have tried to mark folder with search index as compressed, but it did not work.

P.P.S. We have found the reference to Windows Search 4 index size estimation. It is in Windows Search Frequently Asked Questions, see answer on "What is average size of a user's index?" question.

Friday, August 12, 2011 9:20:18 AM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | Window Search
# Monday, August 1, 2011

Yesterday (2011-07-31) we have finished the project (development and support) of the modernization of Cool:GEN code base to java for the Chicago Mercantile Exchange.

It wasn't the first such project but definitely most interesting. We have migrated and tested about 300 MB of source code. In the process of translation we have identified many bugs that were present in the original code. Thanks to languages-xom that task turned to be pure xslt.

We hope that CME's developers are pleased with results.

If you by chance is looking for Cool:GEN conversion to java, C#, or even COBOL (don't understand why people still asking for COBOL) then you can start at bphx site.

Monday, August 1, 2011 6:32:15 AM UTC  #    Comments [0] -

# Wednesday, July 27, 2011

An xslt code that worked in the production for several years failed unexpectedly. That's unusual, unfortunate but it happens.

We started to analyze the problem, limited the code block and recreated it in the simpe form. That's it:

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:t="http://www.nesterovsky-bros.com/xslt/public"
  exclude-result-prefixes="t xs">

<xsl:template match="/" name="main">
  <xsl:variable name="content">
    <root>
      <xsl:for-each select="1 to 3">
        <item/>
      </xsl:for-each>
    </root>
  </xsl:variable>

  <xsl:variable name="result">
    <root>
      <xsl:for-each select="$content/root/item">
        <section-ref name-ref="{t:generate-id()}.s"/>
        <!--
        <xsl:variable name="id" as="xs:string"
          select="t:generate-id()"/>
        <section-ref name-ref="{$id}.s"/>
        -->
      </xsl:for-each>
    </root>
  </xsl:variable>

  <xsl:message select="$result"/>
</xsl:template>

<xsl:function name="t:generate-id" as="xs:string">
  <xsl:variable name="element" as="element()">
    <element/>
  </xsl:variable>

  <xsl:sequence select="generate-id($element)"/>
</xsl:function>

</xsl:stylesheet>

This code performs some transformation and assigns unique values to name-ref attributes. Values generated with t:generate-id() function are guaranteed to be unique, as spec claims that every node has its unique generate-id() value.

Imagine, what was our surprise to find that generated elements all have the same name-ref's. We studied code all over, and found no holes in our reasoning and implementation, so our conlusion was: it's Saxon's bug!

It's interesting enough that if we rewrite code a little (see commented part), it starts to work properly, thus we suspect Saxon's optimizer.

Well, in the course of development we have found and reported many Saxon bugs, but how come that this little beetle was hiding so long.

We've verified that the bug exists in the versions 9.2 and 9.3. Here is the bug report: Saxon 9.2 generate-id() bug.

Unfortunatelly, it's there already for three days (2011-07-25 to 2011-07-27) without any reaction. We hope this will change soon.

Wednesday, July 27, 2011 8:02:38 PM UTC  #    Comments [0] -
Tips and tricks | xslt
# Friday, July 22, 2011

We needed to track a stream position during creation of xml file. This is to allow random access to a huge xml file (the task is related to WindowsSearch).

This is a simplified form of the xml:

<data>
  <item>...</item>
   ...
  <item>...</item> 
</data>

The goal was to have stream position of each item element. With this in mind, we've decided to:

  • open a stream, and then xml writer over it;
  • write data into xml writer;
  • call Flush() method of the xml writer before measuring stream offset;

That's a code sample:

var stream = new MemoryStream();
var writer = XmlWriter.Create(stream);

writer.WriteStartDocument();
writer.WriteStartElement("data");

for(var i = 0; i < 10; ++i)
{
  writer.Flush();

  Console.WriteLine("Flush offset: {0}, char: {1}",
    stream.Position,
    (char)stream.GetBuffer()[stream.Position - 1]);
 
  writer.WriteStartElement("item");
  writer.WriteValue("item " + i);
  writer.WriteEndElement();
}

writer.WriteEndElement();
writer.WriteEndDocument();

That's the output:

Flush offset: 46, char: a
Flush offset: 66, char: >
Flush offset: 85, char: >
Flush offset: 104, char: >
Flush offset: 123, char: >
Flush offset: 142, char: >
Flush offset: 161, char: >
Flush offset: 180, char: >
Flush offset: 199, char: >
Flush offset: 218, char: >

Funny, isn't it?

After feeding the start tag <data>, and flushing xml writer we observe that only "<data" has been written down to the stream. Well, Flush() have never promissed anything particular about the content of the stream, so we cannot claim any violation, however we expected to see whole start tag.

Inspection of the implementation of xml writer reveals laziness during writting data down the stream. In particular start tag is closed when one starts the content. This is probably to implement empty tags: <data/>.

To do the trick we had to issue empty content, moreover, to call a particular method with particular parameters of the xml writer. So the code after the fix looks like this:

var stream = new MemoryStream();
var writer = XmlWriter.Create(stream);

writer.WriteStartDocument();
writer.WriteStartElement("data");

char[] empty = { ' ' };

for(var i = 0; i < 10; ++i)
{
  writer.WriteChars(empty, 0, 0);
  writer.Flush();

  Console.WriteLine("Flush offset: {0}, char: {1}",
    stream.Position,
    (char)stream.GetBuffer()[stream.Position - 1]);

  writer.WriteStartElement("item");
  writer.WriteValue("item " + i);
  writer.WriteEndElement();
}

writer.WriteEndElement();
writer.WriteEndDocument();

And output is:

Flush offset: 47, char: >
Flush offset: 66, char: >
Flush offset: 85, char: >
Flush offset: 104, char: >
Flush offset: 123, char: >
Flush offset: 142, char: >
Flush offset: 161, char: >
Flush offset: 180, char: >
Flush offset: 199, char: >
Flush offset: 218, char: >

While this code works, we feel uneasy with it.

What's the better way to solve the task?

Update: further analysis shows that it's only possible behaviour, as after the call to write srart element, you either can write attributes, content or end of element, so writer may write either space, '>' or '/>'. The only question is why it takes WriteChars(empty, 0, 0) into account and WriteValue("") it doesn't.

Friday, July 22, 2011 9:08:36 PM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | Window Search
# Wednesday, July 13, 2011

As you probably know we have implemented our custom Protocol Handler for the Windows Search.

It's called .xml-gz, and has a goal to index compressed xml files and to have search results with a subtree precision. So, for xml:

<data>
  <item>...</item>
  <item>...</item>
  ...
</data>

search finds results within item and returns xml's url and stream offset of the item. Using ZLIB API we can compress data with stream bookmarks, so fast random access to the data is possible.

The only problem we have is about notification of changes (create, delete, update) of such files.

Spec describes several techniques (nothing has worked for us):

1. Call catalogManager.ReindexMatchingURLs() - it just returns without any impact.

2.Call changeSink.OnItemsChanged() - returns error.

3. Implement .xml-gz IFilter and call IGatherNotifyInline (see " have your .zip urls indexed when they are created or modified") - that's a mistery, as:

4. Implement root url in form .xml-gz:/// and perform Windows Search:

SELECT
  System.ItemUrl, System.DateModified
FROM
  SystemIndex WHERE System.FileExtension='.xml-gz'

to find all .xml-gz sources. This is not reliable, as your protocol handler can be (and is) called before file is indexed.

So, the only reliable way to index your data is to (re-)add indexing rule for the protocol handler, which in most cases reindexes everything.

The only bearable solution we found is to define indexing rule in the form: .xml-gz://file:d:/data/... and to use IShellFolder(2) interfaces to discover sub items and their modification times. This technique allows minimal data scan when you're (re-)add indexing rule.

Wednesday, July 13, 2011 8:21:00 PM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | Window Search
# Saturday, July 9, 2011

Being unexperienced with Windows Search we tried to build queries to find data in the huge storage. We needed to find a document that matches some name pattern and contains some text.

Our naive query was like this:

select top 1000
  System.ItemUrl
from
  SystemIndex
where
  scope = '...' and
  System.ItemName like '...%' and
  contains('...')

In most cases this query returns nothing and runs very long. It's interesting to note that it may start returning data if "top" clause is missing or uses a bigger number, but in this cases query is slower even more.

Next try was like this:

select top 1000
  System.ItemUrl
from
  SystemIndex
where
  scope = '...' and
  System.ItemName >= '...' and System.ItemName < '...' and
  contains('...')

This query is also slow, but at least it returns some results.

At some point we have started to question the  utility of Windows Search if it's so slow, but then we have found that there is a property System.ItemNameDisplay, which in our case coincides with the value of property System.ItemName, so we have tried the query:

select top 1000
  System.ItemUrl
from
  SystemIndex
where
  scope = '...' and
  System.ItemNameDisplay like '...%' and
  contains('...')

This query worked fast, and produced good results. This hints that search engine has index on System.ItemNameDisplay in contrast to System.ItemName property.

We've looked at property definitions:

System.ItemNameDisplay

The display name in "most complete" form. It is the unique representation of the item name most appropriate for end users.

propertyDescription
    name = System.ItemNameDisplay
    shellPKey = PKEY_ItemNameDisplay
    formatID = B725F130-47EF-101A-A5F1-02608C9EEBAC
    propID = 10
    searchInfo
       inInvertedIndex = true
       isColumn = true
       isColumnSparse = false
       columnIndexType = OnDisk
       maxSize = 128

System.ItemName

The base name of the System.ItemNameDisplay property.

propertyDescription
    name = System.ItemName
    shellPKey = PKEY_ItemName
    formatID = 6B8DA074-3B5C-43BC-886F-0A2CDCE00B6F
    propID = 100
    searchInfo
       inInvertedIndex = false
       isColumn = true
       isColumnSparse = false
       columnIndexType = OnDisk
       maxSize = 128

Indeed, one property is indexed, while the other is not.

As with other databases, query is powerful when engine uses indices rather than performs data scan. This is also correct for Windows Search.

The differences in results that variations of query produce also manifests that Windows Search nevertheless is very different from relational database.

Saturday, July 9, 2011 10:01:36 AM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | Window Search
# Tuesday, July 5, 2011

We have developed our custom Windows Search Protocol Handler. The role of this component is to expose items of complex content (or unusual storage) to Windows Search.

You can think of some virtual folder, so a Protocol Handler allows to enumerate it's files, file properties, and contents.

The goal of our Protocol Handler is to represent some data structure as a set of xml files. We expected that if we found a data within a folder with these files, then a search within Protocol Handler's scope would bring the same (or almost the same) results.

Reality is different.

For some reason .xml IFilter (a component to extract text data to index) works differently with file system and with our storage. We cannot state that it does not work, but for some reason many words that Windows Search finds within a file are never found within Protocol Handler scope.

We have observed that if, for purpose of indexing, we represent content xml items as .txt files, then search works as expected. So, our workaround was to present only xml's text data for the indexing, and to use .txt IFilter (this in fact roughly what .xml IFilter does by itself).

Is there a conclusion?

Well, Windows Search is a black box probably containing bugs. Its behaviour is not always obvious.

Tuesday, July 5, 2011 8:31:47 PM UTC  #    Comments [0] -
Thinking aloud | Tips and tricks | Window Search
# Friday, June 24, 2011

Let's put it blatantly: Windows Search 4 has design and implementation problems.

You discover this immediatelly when you start implementing indexing of custom file format.

If you want to index simple file format then you need to imlement you IFilter interface. But if it has happened so that you want to index compound data then you should invent you own protocols.

If you will fugure out how to implement your protocol to index that compound data, then you will most probably stuck on the problem on how to notify indexer about the changes.

The problem is that Windows Search 4 has API to reindex urls, which simply does not work, or to notify indexer about changes, which throws an error (returns error HRESULT) for custom protocols. At least, we were not able to make it run.

Friday, June 24, 2011 7:39:36 PM UTC  #    Comments [0] -
Thinking aloud | Window Search
# Thursday, June 16, 2011

There is a problem with XML serialization of BigDecimal values, as we've written in one of our previous articles "BigDecimal + JAXB => potential interoperability problems". And now we ran into issue with serialization of double / Double values. All such values, except zero, serialize in scientific format, even a value contains only integer part. For example, 12 will be serialized as 1.2E+1. Actually this is not contradicts with XML schema definitions.

But what could be done, if you want to send/receive double and/or decimal values in plain format. For example you want serialize a double / BigDecimal value 314.15926 in XML as is. In this case you ought to use javax.xml.bind.annotation.adapters.XmlAdapter.

In order to solve this task we've created two descendants of XmlAdapter (the first for double / Double and the second for BigDecimal), click here to download the sources.

Applying these classes on properties or package level you may manage XML serialization of numeric fields in your classes.

See this article for tips how to use custom XML serialization.

Thursday, June 16, 2011 10:14:36 PM UTC  #    Comments [0] -
Java | Tips and tricks
Archive
<September 2011>
SunMonTueWedThuFriSat
28293031123
45678910
11121314151617
18192021222324
2526272829301
2345678
Statistics
Total Posts: 387
This Year: 3
This Month: 0
This Week: 0
Comments: 1636
Locations of visitors to this page
Disclaimer
The opinions expressed herein are our own personal opinions and do not represent our employer's view in anyway.

© 2024, Nesterovsky bros
All Content © 2024, Nesterovsky bros
DasBlog theme 'Business' created by Christoph De Baene (delarou)