Category Archives: architecture

BeJUG talk, NoSQL with Hadoop and Hbase, Steven Noels

Notes are a little bit cryptic, but still…

NoSQL with HBase and Hadoop, Steven Noels, Bejug 17.06.2010

Intro

“An evolution drive by pain”
Various types of databases, standardized to RDBMS, further simplified to ORM frameworks

We are now living in a world with massive data stores, with caching, denormalization, sharding, replication,… There came a need to rething the problem, resulting in NoSQL.

Four trends:
– data size, every two years more data is created than existed before
– connectedness, more and more linking between data
– semi-structure,
– architecture, from single client for data, to multiple applications on data (make the db an integration hub), to decoupled services with their own back-end (not mentioned, but the next step will be integration of the back-ends)

Data management was a cost (hardware, DBA, infrastructure people, DB licenses,…)
Moving to considering data as an opportunity to learn about your customers, so you should capture as much as you can.

It is a Cambrian explosion (lot’s of evolution/new species, but only the tough/best will survive):
HBase, Cassandra, CouchDB, neo4j, riak, Redis, MongoDB,…

Some solutions may no longer exist in a couple of years, and some will become better and popular.

Common themes:
– scale, sscale, scale
– new data models
– devops, more interaction between developers, dba, infrastructure
– N-O-SQL, not only SQL
– cloud: technology is of no interest any more

New data:
– Sparse structures
– weak schemas
– graphs
– semi-structures
– document oriented

NoSQL
– not a movement
– not ANSI NoSQL-2010, there is no standard and it not expected there soon will be
– not one size fits all
– not (necessarily) anti-RDBMS
– not a silver bullet

NoSQL is pro choice

Use NoSQL for…
– horizontal scale (out instead of up)
– unusually common data (free structured)
– speed (especially for writes)
– the bleeding edge

Use SQL/RDBMs for…
– SQL
– ACID
– normalization
– a defined liability

Theory

See also Google Bigtable and Amazon Dynamo papers, Eric Brewer’s CAP theorem
discuss NoSQL papers : nosqlsummer.org

Dynamo: coined the term “eventual consistency”, consistent hashing
Bigtable: multi-dimensional column oriented database, on top of GoogleFileSystem, object versioning
CAP: you can only have two out of three of “string consistency”, “high availability”, “partition tolerance”

Difference between ACID (rdmb, pessimistic, strong consistency, less available, complex, analuzable) and BASE (availability and scaling highest priority, weak consistency, optimistic, best effort, simple and fast)

Hadoop: HDFS + MapReduce, single filesystem and single execution space
MapReduce is used for analytical and/or batch processing
Hadoop ecosystem: Chukwa, HBase, HDFS, Hive, Mapreduce, Pig, ZooKeeper,…
Benefit or parallellisation, more ad-hoc processing, compartmentalized approach reduces operational risk

Technology

Types:

  • key-value stores

    focus on scaling huge amounts of data

    Regis
    – vmware
    – very fast but mostly one server

    Voldemort
    – LinkedIn
    – persistent distributed
    – fault-tolerant
    – java based

  • column stores

    BigTable clones
    sparse tables
    data model: columns->column families->cells

    BigTable

    HBase
    – Stumbleupon, Adobe, Cloudera
    – sorted
    – distributed
    – highly-available
    – high performance
    – multi-dimensional (timestamp)
    – persisted
    – random access layer on HDFS
    – has a central master node

    Cassandra
    – Rackspace, Facebook
    – key-value with added structure
    – reliability (no master node)
    – eventual consistent
    – distributed
    – tunable partitioning and replication
    – PRO linear scale, write optimized
    – CON 1 row must fit in ram, only pk based querying

  • document databases

    Lotus Notes heritage
    key-value stores but DB knows what the value is
    documents often versioned
    collections of key-value collections

    CouchDB
    – fault tolerant
    – schema-free
    – document oriented
    – RESTful HTTP interface
    – document is a JSON object
    – view system is MapReduce based, Filter, Collate, Aggregate, all javascript
    – out-of-the box all data needs to fit on one machine

    MongoDB
    – like CouchDB
    – C++
    – performance focus
    – native drivers
    – auto sharing (alpha)

    Riak

  • graph databases

    data is nodes + relationships + key/value properties

    neo4j
    – mostly RAM centric
    – SPARQL/SAIL implementation
    – scaling to complexity (rather than volume?)
    – ‘whiteboard” friendly
    – many language bindings
    – little remoting

Open source essentials

I have been struck twice recently by surprising comments about open source software.

Before continuing, let me be clear. I am a fan of open source. I like it, will use it when I can and am member of a few open source “communities” (including some I lead).

The license matters

In a commercial project I am part of, we were using the firebird open source database for development. Now we have to cooperate with a partner who prefers to use MySQL. This is assumed to be no problem as it is open source.

No so, Connector/J, the MySQL JDBC driver is distributed under the GPL license. This means that anything “linked” with the driver needs to be GPL as well. Oops, so any application accessing MySQL using Conncetor/J has to be GPL. We’ll have to find another solution (like using the jdbc2odbc bridge).

Most notable for me was that some of our developers were not aware of the difference between open source licenses and the implications they have. That is a big problem as it can cause major problems for the companies they work for (and yes, I assume there are many closed source applications which use Connector/J and thus violate the licensing terms).

Open source is not better than commercial software

In this article, “Commercial Packages vs Open Source: Which is Better Long-term?“, the author questions whether using open source is a good business decision.

This is a typical story, of which there are a few variants. One is that commercial software is better because you get support. To my view this is not a differencing factor between open or closed source software. The other says open source is better because you can access the source code. This again is not important in many cases. And of course there are the anarchists who claim that everything should be free.

Without considering the philosophical merits of open vs closed source software, you have to consider both as two competing business models. Each has its merits and weaknesses, but most importantly both can be executed well or badly.

  • Support :
    Only a limited set of open source projects have professional, for pay support. If that is what you want, you better check whether it is available. However, even though it may not be advertised, in most active projects contacting the the project leads or main developers can give you assurances of support. Then again, if price is one of your main motives, there often is quite a lot of free help is well. When you ask your questions in an intelligent way, there are often people who want to show they are smart enough to help you.
    What is important to consider is that commercial companies don’t all give the same quality of support, just as not all open source communities are equally helpful. In closed source projects, the support is sometimes lacking, it is not always considered essential by the company. For open source projects which offer commercial support, that is their business. As there is no other source of revenue, they better make sure the support is good or others will do better (the source and thus inner knowledge is accessible to everyone) if there are wanting customers.
  • Community :
    This admittedly can be a weak point for open source projects. From all the OS projects which exist, only a relatively small number is active and even less have a large community. In the same vein however, most closed source projects also have a limited number of customers. As always whether the project fits your needs and your trust in the people behind the project are the driving factors behind your choice. Try getting in touch and see how helpful they are, then make up your mind.
  • Price :
    Obviously, any software which is more expensive than your budget allows is out of reach. However, you have to consider all costs. These include time spent because of (lack of) documentation, license and redistribution fees, support fees. These may also include costs of opportunity lost because of slow support. Again open vs closed source is not the differencing factor, it is the quality which is import.
  • Access to the source :
    The importance of this depends on the kind of product. For end-user products and tools, the availability of the source code is usually irrelevant. However, for middleware or software libraries, being able to examine the source code may be interesting to learn more about the system and use it more efficiently. It can also be useful for having quicker fixes to some problems, though this also depends on the quality of the support. There are also closed source companies which (sometimes for an extra fee) make the source code available to their customer, again making the difference between the business models smaller

Conclusion

Make sure you know the license and redistribution limitations, they may not be compatible with the project you are working on.

Investigate the products to determine which are suitable. Ask questions to check whether you are being helped. Ask questions about the support options, there may well be more options than advertised on the website. Calculate the total cost compared with competing solutions, possibly including building it yourself. Then make your choice. Whether it is open or closed source, free or for-pay should not be key deciding factors.

Choosing a web framework for rich internet enterprise application development

There is a myriad of web frameworks to choose from when developing a rich internet application. Most of them support the current buzzwords like “ria”, “web 2.0”, “ajax”.

However let us consider what is really import when choosing a web framework and how the user experience can really be “rich”, particularly in the context of an enterprise application.

Easy and fast data entry

It is actually a shame. In the evolution from terminal (and dos) based applications to (thick) gui applications to (thin) web applications we have generally evolved from applications where users could enter data very quickly using just the keyboard and the “enter” or “tab” keys to jump over unneeded fields to applications which look good, but are less practical and/or slower to work with.

Entering (bulk) data is still needed in modern applications, but it is often overlooked when focussing on more sexy aspects of the user interface. However this kind of easy of entering data should still (or again) be possible. Yes it is nice to be able to jump past a group of fields using your mouse, and yes the advanced graphical visualization techniques can be a major improvement, but we should not forget the basics. Applications should properly support quick keyboard-only data entry.

I suggest that you can choose between “tab” or “enter” to move to the next field. To assure actually submitting the information is easy, pressing “ctrl-enter” can be used. To make it clear what is going to happen in that case, it can be useful to make a visual distinction between the default submit button and the others (like making the label bold).

Consistency

For a user, surprises make an application more difficult to use or even frustrating. As for many things the 80/20 rule applies. Users will spend 80% of their time in 20% of the application. They will get used to operating the parts they constantly use. However, when they need some of the screens or features they use less frequently, it is much more user-friendly to assure things look and feel the same. For example, if in a certain form the linking records is done in a particular way, and the links are displayed in a certain size and colour, then all other forms should allow linking records to be done in the same way, and the display should use size and colour in other pages. That way the user will immediately feel at ease and be able to complete their job a lot faster.

Cooperative development

There are several aspects which are important in the development of web applications. There are many possibilities and it is rare for one person to do be capable of handling both the programming aspects (javascript and whatever server-side language is used), the web layout (html and css knowledge), the design (graphical capabilities, possibly including fotoshop, gimp or similar applications) and usability, user-friendliness and ergonomics. The rare people who master all these aspects usually don’t get the time to give them all the proper attention.

Sometimes this problem is approached by trying to remove some of these aspects out of the development loop. Some frameworks remove the need to have html/css knowledge. Often, it is “hoped” that the programmers sufficiently master the design and usability aspects. I don’t believe this is a good solution.

For me, it is clear that the different aspects should be handled by people who are trained and experienced in these areas, resulting in a split in three roles.

  • The developers need to build an application that works. This requires the programming knowledge and probably some notions of html and css. As long as you assume your developers need to focus on getting things to run this should be fine.
  • A web designer (or a team) should be responsible to make sure the user interface looks good. They need to have the feel for what “looks good” means in a user interface, and they will need to translate this to the html, css and images which are needed to build this user interface.
  • A usability or ergonomics expert should then evaluate the application and give hints and tips to the developers and web designers to make the application easier and more user-friendly. This is a separate expertise and mostly overlooked. You could argue that this role is more important for web applications with an advertising goal than for enterprise applications. While this is probably true, applying this expertise will improve your user’s experience and make them happier with your products.

Framework tips

Considering the points mentioned above, there are some choices which can be relevant for the framework choices you make.

  • Achieving the consistency in the user interface is a lot easier if your web framework is component based. You can then build component for the different types of objects or even for groups of objects. This should make your life a lot easier.
    A step further towards this goal would be to generate your user interface from a domain model as much as possible. As long as this generation process is sufficiently customizable, this can be major time saver and assure the consistency.
  • I am not in favour of frameworks which try to hide the html and css from the developer. This is good if the split between developer and designer is so clean that they don’t see each other’s work, but this seems impossible to achieve.
    However, the framework should support the cooperation between the two. This requires a good application of the model-view-controller split. The developer and designer will need to communicate a bit and help each other, but it should be clear who is in control of which files (rather then – as in jsp/asp – both needing access equally much).
  • The usability export is cross-cutting the above two considerations. When they have been met, integrating the recommendations should be a breeze.

So what are my choices? I use the tapestry framework. It has excellent support for components and very clean separation between component and page templates (which contain the html markup) and code. There are some notions the designer will need to understand, but nothing too difficult.

To augment this even further, the equanda framework allows the user interface to be generated and allow sufficient overwriting possibilities to be usable for generation the full crud user interface and provide useful components to help build extra custom pages. For example the keyboard support is standard in equanda generated user interfaces and even assures automatic navigation to the next tab page when these are used in your user interface.