RSS 2.0
Sign In
# Friday, July 26, 2019

We were asked to help with search service in one enterprise. We were told that their SharePoint portal does not serve their need. Main complaints were about the quality of search results.

They have decided to implement external index of SharePoint content, using Elastic, and expose custom search API within the enterprise.

We questioned their conclusions, asked why did they think Elastic will give much better results, asked did they try to figure out why SharePoint give no desired results.

Answers did not convince us though we have joined the project.

What do you think? Elastic did not help at all though they hoped very much that its query language will help to rank results in a way that matched documents will be found. After all they thought it was a problem of ranking of results.

Here we have started our analysis. We took a specific document that must be found but is never returned from search.

It turned to be well known problem, at least we dealt with closely related one in the past. There are two ingredients here:

  • documents that have low chances to be found are PDFs;
  • we live in Israel, so most texts are Hebrew, which means words are written from right to left, while some other texts from left to right. See Bi-directional text.

Traditionally PDF documents are provided in a way that only distantly resembles logical structure of original content. E.g., paragraphs of texts are often represented as unrelated runs of text lines, or as set of text runs representing single words, or independant characters. No need to say that additional complication comes from that Hebrew text are often represented visually (from left to right, as if "hello" would be stored as "olleh" and would be just printed from right to left). Another common feature of PDF are custom fonts with uncanonical mappings, or images with glyphs of letters.

You can implement these tricks in other document formats but for some reason PDF is only format we have seen that regularly and intensively uses these techniques.

At this point we have realized that it's not a fault of a search engine to find the document but the feature of the PDF to expose its text to a crawler in a form that cannot be used for search. In fact, PDF cannot search by itself in such documents, as when you try to find some text in the document opened in a pdf viewer, that you clearly see in the document, you often find nothing.

A question. What should you do in this case when no any PDF text extractor can give you correct text but text is there when you looking at document in a pdf viewer?

We decided it's time to go in the direction of image recognition. Thankfully, nowdays it's a matter of available processing resources.

Our goal was:

  1. Have images of each PDF page. This task is immediately solved with Apache PDFBox (A Java PDF Library) - it's time to say this is java project.
  2. Run Optical Character Recognition (OCR) over images, and get extracted texts. This is perfectly done by tesseract-ocr/tesseract, and thankfully to its java wrapper bytedeco/javacpp-presets we can immediately call this C++ API from java.

The only small nuisance of tesseract is that it does not expose table recognition info, but we can easily overcome it (we solved this task in the past), as along with each text run tesseract exposes its position.

What are results of the run of such program?

  1. Full success! It works with high quality of recognition. Indeed, there is no any physical noise that impacts quality.
  2. Slow speed - up to several seconds per recognition per page.
  3. Scalable solution. Slow speed can be compensated by almost unlimited theoretical scalability.

So, what is the lesson we have taked from this experience?

Well, you should question yourself, test and verify ideas on the ground before building any theories that will lead you in completely wrong direction. After all people started to realize there was no need to claim on SharePoint, to throw it, and to spend great deal of time and money just to prove that the problem is in the different place.

A sample source code can be found at App.java

Friday, July 26, 2019 4:38:11 PM UTC  #    Comments [0] -
C++ | Java | Thinking aloud | Tips and tricks
# Monday, August 6, 2018

For more than 25 years continues a discussion in C++ community about exceptions. In our opinion this can only be compared with math community and their open problems like Hilbert's 23 problems dated by 1900.

In essence C++ exception discussion is about efficiency of exceptions vs status codes. This problem is not so acute in other languages (like java or C#) because those languages postulate different goals.

C++ designers have introduced a zero-overhead principle for any language feature, which is:

  1. If you don’t use some feature you don’t pay for it.
  2. If you do use it you cannot (reasonably) write it more efficiently by hand.

Exceptions comparing to status codes do not withstand this demand. This led to the fragmentation of C++ comunity where many big projects and code styles ban exceptions partially or completely.

Make no doubt that all this time people were trying to make exceptions better, and have found techniques to make them space and run time efficient to some extent, but still, old plain status codes outperform both in speed (especially in predictability of time of exception handling logic) and in code size.

We guess the solution is finally found after the quarter the century of discussion!

WG paper: Zero-overhead deterministic exceptions: Throwing values by Herb Sutter. This "paper aims to extend C++’s exception model to let functions declare that they throw a statically specified type by value. This lets the exception handling implementation be exactly as efficient and deterministic as a local return by value, with zero dynamic or non-local overheads."

In other words author suggests to:

  • extend exception model (in fact implement additional one);
  • make exceptions as fast and as predicable as status codes (which virtually means designate a status code as a new exception);

 

Here are author's arguments:

  1. Status code is usually just a number, and handling an error is like to perform some if or switch statements.
  2. Handling errors with status codes is predicable in time and visible in the code though it burdens the logic very much.
  3. Exceptions give clear separation between a control flow and error handling.
  4. Exceptions are mostly invisible in the code, so it's not always clear how much they add to code size and how they impact performance.
  5. Statistics show that exceptions add 15 to 30% to size of code, even if no exceptions are used (violation of zero-overhead principle);
  6. Exceptions require Run Time Type Information about types, and have no predictable memory (stack or heap) footprint  (violation of zero-overhead principle).

 

What aurhor suggests might seem too radical but at present it's only viable solution to reestablish zero-verhead principle and to reunite two C++ camps.

Here is the approach.

  1. Clarify what should be considered as an exception.
    1. Contract violation.

      Are contract violation like invalid values of arguments or invalid post conditions (unhold invariants) are exceptions or programmer's bugs?

      If later then it's best to terminate, as you cannot correctly recover from bug.

    2. Virtual Machine fault.

      What user program can do with stack overflow?

      The best according to the author it to terminate.

    3. OOM - Out Of Memory error.

      What is the best way to deal with OOM dyring dynamic allocation.

      At present there are two operators:

      • new - to allocate memory dynamically and throw bad_alloc on failure.
      • new(nothrow) - to allocate memory dynamically and return nullptr on failure.

      Herb Sutter suggests to change new behavior to terminate on failure (it is very hard to properly handle bad_alloc anyway), while new(nothrow) will still allow to build code that can deal with OOM.

    4. Partial success

      This should never be reported as an error, and status codes should be used to report the state.

    5. Error condition distinct from any type of success.

      This is where exceptions should be used.

    Statistics shows that with such separation more than 90% of what curently is an exception will not be exception any more, so no any hidden exception logic is required: program either works or terminates.

  2. Refactor exception

    Redefine what exception object is and how it is propagated.

    It should be thin value type. At minimum it needs to contain an error code. Suggested size is up to a couple of pointers.

    Compiler should be able to cheaply allocate and copy it on the stack or even in the processor's registers.

    Such simple exception type resolves problems with lifetime of exception object, and makes exception handling as light as checking status codes.

    Exception should be propagated through return chanel, so it's like a new calling convention that defines either function result or error outcome.

It's not our intention to quote whole the paper here. If you're interested then please read it. Here we want to outline our concerns.

  1. Exception payload.

    This paper emphasizes that exception type should be small.

    So, what to do with exception payload, if any (at least error message if it's not a static text)?

    If this won't be resolved then developers will start to invent custom mechanisms like GetLastErrorMessage().

    And what to do with aggregate exceptions?

    We think this must be clearly addressed.

  2. Implemntation shift.

    We can accept that most of the current exceptions will terminate.

    But consider now some container that serves requests, like web container or database.

    It may be built from multiple components and serve multiple requests concurently. If one request will terminate we don't want for container to terminate.

    If terminate handler is called then we cannot rely on state of the application. At least we can expect heap leaks and un-released resources.

    So, we either want to be able release heap and other resources per request, or we want to go down with whole process and let OS deal with it.

    In the later case we need to design such containers differently: as a set of cooperative processes; OS should allow to spin processes more easily.

  3. VM with exceptions

    There are Virtual Machines that allow exception to be thrown on each instruction (like JVM, or CLI).

    You cannot tell in this case that code would never throw exception, as it can out of the blue!

    Event in x86 you can have PAGE FAULT on memory access, which can be translated into an exception.

    So, it's still a question whether the terminate() solution is sound in every case, and whether compiler can optimize out exception handling if it proves staticlly that no exception should be thrown.

Monday, August 6, 2018 12:00:10 PM UTC  #    Comments [0] -
C++ | Thinking aloud
# Thursday, October 6, 2016

Our genuine love is C++. Unfortunately clients don't always share our favors, so we mostly occupied in the C#, java and javascript. Nevertheless, we're closely watching the evolution of the C++. It became more mature in the latest specs.

Recently, we wondered how would we deal with dependency injection in C++. What we found is only strengthened our commitment to C++.

Parameter packs introduced in C++ 11 allow trivial implementation of constructor injection, while std::type_index, std::type_info and std:any give service containers.

In fact there are many DI implementations out there. The one we refer here is Boost.DI. It's not standard nor we can claim it's the best but it's good example of how this concept can be implemented.

So, consider their example seen in Java with CDI, in C# in .NET Core injection, and in C++:

Java:

@Dependent
public class Renderer 
{
  @Inject @Device
  private int device;
};

@Dependent
public class View 
{
  @Inject @Title
  private String title;
  @Inject
  private Renderer renderer;
};

@Dependent
public class Model {};

@Dependent
public class Controller 
{
  @Inject
  private Model model;
  @Inject
  private View view;
};

@Dependent
public class User {};

@Dependent
public class App 
{
  @Inject
  private Controller controller;
  @Inject
  private User user;
};

...
  Privider<App> provider = ...

  App app = provider.get();

C#:

public class RenderedOptions
{
  public int Device { get; set; }
}
    
public class ViewOptions
{
  public int Title { get; set; }
}
    
public class Renderer 
{
  public Renderer(IOptions<RendererOptions> options)
  {
    Device = options.Device;
  }

  public int Device { get; set; }
}

public class View 
{
  public View(IOptions<ViewOptions> options, Renderer renderer)
  {
    Title = options.Title;
    Renderer = renderer;
  }

  public string Title { get; set; }
  public Renderer Renderer { get; set; }
}

public class Model {}

public class Controller 
{
  public Controller(Model model, View view) 
  {
    Model = model;
    View = view;
  }

  public Model Model { get; set; }
  public View View { get; set; }
};

public class User {};

public class App 
{
  public App(Controller controller, User user) 
  {
    Controller = controller;
    User = user;
  }

  public Controller Controller { get; set; }
  public User User { get; set; }
};

...
  IServiceProvider serviceProvider = ...

  serviceProvider.GetService<App>();

C++:

#include <boost/di.hpp>

namespace di = boost::di;

struct renderer 
{
  int device;
};

class view 
{
public:
  view(std::string title, const renderer&) {}
};

class model {};

class controller 
{
public:
  controller(model&, view&) {}
};

class user {};

class app 
{
public:
  app(controller&, user&) {}
};

int main()
{
  /**
   * renderer renderer_;
   * view view_{"", renderer_};
   * model model_;
   * controller controller_{model_, view_};
   * user user_;
   * app app_{controller_, user_};
   */

  auto injector = di::make_injector();
  injector.create<app>();
}

What is different between these DI flavors?

Not too much from the perspective of the final task achieved.

In java we used member injection, with qualifiers to inject scalars.

In C# we used constructor injection with Options pattern to inject scalars.

In C++ we used constructor injection with direct constants injected.

All technologies have their API to initialize DI container, but, again, while API is different, the idea is the same.

So, expressiveness of C++ matches to those of java and C#.

Deeper analysis shows that java's CDI is more feature rich than DI of C# and C++, but, personally, we consider it's advantage of C# and C++ that they have such a light DI.

At the same time there is an important difference between C++ vs java and C#.

While both java and C# are deemed to use reflection (C# in theory could use code generation on the fly to avoid reflection), C++'s DI natively constructs and injects services.

What does it mean for the user?

Well, a lot! Both in java and in C# you would not want to use DI in a performance critical part of code (e.g. in a tight loop), while it's Ok in C++ due to near to zero performance impact from DI. This may result in more modular and performant code in C++.

Thursday, October 6, 2016 11:27:42 AM UTC  #    Comments [0] -
.NET | C++ | Java | Thinking aloud
# Sunday, August 17, 2014

Among latest C++ proposals the most ambiguous is N4021.

The goal of that proposal is "to define a 2D drawing API for the C++ programming language".

The motivation is going like this:

Today, computer graphics are pervasive in modern life, and are even replacing console-style I/O for basic user interaction on many platforms. For example, a simple cout << "Hello, world!" statement doesn’t do anything useful on many tablets and smartphones. We feel that C++ programmers should have a simple, standard way of displaying 2D graphics to users.

Authors compare several public and proprietary APIs to select the one named cairo graphics library as a base.

Reflecting on starting point they write:

Taken as a whole, starting from cairo allows for the creation of a 2D C++ drawing library that is already known to be portable, implementable, and useful without the need to spend years drafting, implementing, and testing a library to make sure that it meets those criteria.
...
An alternative design would be to create a new API via a synthesis of existing 2D APIs. This has the benefit of being able to avoid any perceived design flaws that existing APIs suffer from. Unfortunately this would not have implementation and usage experience. Further, doing so would not provide any guarantee that design flaws would not creep in.

What follows is a discussion on best way to transform that C library into std style C++ API.

 

Our thoughts on this proposal are threefold:

  1. This proposal seems a decade or two late.
  2. C++ standard should be modular to support basic and optional features.
  3. We feel that programmers will not be satisfied with bare 2D graphics. It's not enough at nowadays.

 

Indeed, appeals to create standard C++ API for UI are as old as the C++'s standardization process. It's clear why did the committee not produce such API yet: they are bureaucracy that can approve API only. In fact it's a role of community to invent and implement libraries that may make their way into the standard. Without consensus in community no standard will reflect such API.

On the other hand C++ spec at present is too fat. Probably, not many people are satisfied with the pace of its evolution. Any big chunk of a new API makes the progress even slower. C++ spec should go through a refactoring and be split into core(s) and libraries and to allow individual progress of each part. This will simplify both specification and implementation. After that refactoring an API can be added or deprecated much more easily. In fact implementations were always like this. It's the spec that tries to be monolith.

As for a new 2D graphics API. It looks like an idea from late 90-es. We think that today's programmers (at least several samples :-) ) wished to deal with industry standard UI API, and not to start from basic drawing. Looking around we observe that html 5 is such de-facto standard. Take into an account that it supports rich layout, svg, canvas, user input; in addition it's good for GPU optimization. Even if you want to deal with simple graphics then you can build svg markup or draw on the canvas.

So, what we rather prefer to see in the C++ spec is an html binding API (both for DOM and Javascript).

Just think of standard C++ program that uses html engine as its UI!

Sunday, August 17, 2014 8:56:08 AM UTC  #    Comments [0] -
C++ | Thinking aloud
# Friday, March 16, 2012

After C++11 revision has been approved a new cycle of C++ design has begun:

N3370: The C++ standards committee is soliciting proposals for additional library components. Such proposals can range from small (addition of a single signature to an existing library) to large (something bigger than any current standard library component).

At this stage it's interesting to read papers, as authors try to express ideas rather than to formulate sentences that should go into spec as it lately was.

These are just several papers that we've found interesting:

N3322 12-0012 A Preliminary Proposal for a Static if Walter E. Brown
N3329 12-0019 Proposal: static if declaration H. Sutter, W. Bright, A. Alexandrescu

Those proposals argue about compile time "if statement". The feature can replace #if preprocessor directive, a SFINAE or in some cases template specializations.

A static if declaration can appear wherever a declaration or a statement is legal. Authors also propose to add static if clause to a class and a function declarations to conditionally exclude them from the scope.

Examples:

// Compile time factorial.
template <unsigned n>
struct factorial
{
  static if (n <= 1)
  {
    enum : unsigned { value = 1 };
  }
  else
  {
    enum : unsigned { value = factorial<n - 1>::value * n };
  }
};

// Declare class provided a condition is true.
class Internals if (sizeof(void*) == sizeof(int));

Paper presents strong rationale why this addition helps to build better programs, however the questions arise about relations between static if and concepts, static if clause and an error diagnostics.

 

N3327 12-0017 A Standard Programmatic Interface for Asynchronous Operations N. Gustafsson, A. Laksberg
N3328 12-0018 Resumable Functions Niklas Gustafsson

That's our favorite.

Authors propose an API and a language extensions to make asynchronous programs simpler.

In fact, asynchronous function will look very mush as a regular one but with small additions. It's similar to yield return in C# (a construct that has been available in C# for many years and is well vetted), and to async expression in C# 4.5. Compiler will rewrite such a function into a state machine, thus function can suspend its execution, wait for the data and to resume when data is available.

Example:

// read data asynchronously from an input and write it into an output.
int cnt = 0;

do
{
  cnt = await streamR.read(512, buf);

  if (cnt == 0)
  {
    break;
  }

  cnt = await streamW.write(cnt, buf);
}
while(cnt > 0);

It's iteresting to see how authors will address yield return: either with aditional keyword, or in terms of resumable functions.

 

N3340 12-0030 Rich Pointers D. M. Berris, M. Austern, L. Crowl

Here authors try to justify rich type-info but mask it under the name "rich pointers". To make things even more obscure they argue about dynamic code generation.

If you want a rich type-info then you should talk about it and not about thousand of other things.

We would better appealed to create a standard API to access post-compile object model, which could be used to produce different type-infos or other source derivatives.

This paper is our outsider. :-)

 

N3341 12-0031 Transactional Language Constructs for C++ M. Wong, H. Boehm, J. Gottschlich, T. Shpeisman, et al.

Here people try to generalize (put you away from) locking, and replace it with other word "transaction".

Seems it's not viable proposition. It's better to teach on functional style of programming with its immutable objects.

 

N3347 12-0037 Modules in C++ (Revision 6) Daveed Vandevoorde

Author argues against C style source composition with #include directive, and propose alternative called "modules".

We think that many C++ developers would agree that C pre-processor is a legacy that would never have existed, but for the same reason (for the legacy, and compatibility) it should stay.

In out opinion the current proposition is just immature, at least it's not intuitive. Or in other words there should be something to replace the C pre-processor (and #include as its part), but we don't like this paper from aestetic perspective.

 

N3365 12-0055 Filesystem Library Proposal (Revision 2)

This proposal says no a word about asynchronous nature of file access, while it should be designed around it.

Friday, March 16, 2012 7:21:58 PM UTC  #    Comments [0] -
C++ | Thinking aloud
Archive
<December 2024>
SunMonTueWedThuFriSat
24252627282930
1234567
891011121314
15161718192021
22232425262728
2930311234
Statistics
Total Posts: 387
This Year: 3
This Month: 0
This Week: 0
Comments: 1984
Locations of visitors to this page
Disclaimer
The opinions expressed herein are our own personal opinions and do not represent our employer's view in anyway.

© 2024, Nesterovsky bros
All Content © 2024, Nesterovsky bros
DasBlog theme 'Business' created by Christoph De Baene (delarou)