BeJUG talk, NoSQL with Hadoop and Hbase, Steven Noels

Notes are a little bit cryptic, but still…

NoSQL with HBase and Hadoop, Steven Noels, Bejug 17.06.2010


“An evolution drive by pain”
Various types of databases, standardized to RDBMS, further simplified to ORM frameworks

We are now living in a world with massive data stores, with caching, denormalization, sharding, replication,… There came a need to rething the problem, resulting in NoSQL.

Four trends:
– data size, every two years more data is created than existed before
– connectedness, more and more linking between data
– semi-structure,
– architecture, from single client for data, to multiple applications on data (make the db an integration hub), to decoupled services with their own back-end (not mentioned, but the next step will be integration of the back-ends)

Data management was a cost (hardware, DBA, infrastructure people, DB licenses,…)
Moving to considering data as an opportunity to learn about your customers, so you should capture as much as you can.

It is a Cambrian explosion (lot’s of evolution/new species, but only the tough/best will survive):
HBase, Cassandra, CouchDB, neo4j, riak, Redis, MongoDB,…

Some solutions may no longer exist in a couple of years, and some will become better and popular.

Common themes:
– scale, sscale, scale
– new data models
– devops, more interaction between developers, dba, infrastructure
– N-O-SQL, not only SQL
– cloud: technology is of no interest any more

New data:
– Sparse structures
– weak schemas
– graphs
– semi-structures
– document oriented

– not a movement
– not ANSI NoSQL-2010, there is no standard and it not expected there soon will be
– not one size fits all
– not (necessarily) anti-RDBMS
– not a silver bullet

NoSQL is pro choice

Use NoSQL for…
– horizontal scale (out instead of up)
– unusually common data (free structured)
– speed (especially for writes)
– the bleeding edge

Use SQL/RDBMs for…
– normalization
– a defined liability


See also Google Bigtable and Amazon Dynamo papers, Eric Brewer’s CAP theorem
discuss NoSQL papers :

Dynamo: coined the term “eventual consistency”, consistent hashing
Bigtable: multi-dimensional column oriented database, on top of GoogleFileSystem, object versioning
CAP: you can only have two out of three of “string consistency”, “high availability”, “partition tolerance”

Difference between ACID (rdmb, pessimistic, strong consistency, less available, complex, analuzable) and BASE (availability and scaling highest priority, weak consistency, optimistic, best effort, simple and fast)

Hadoop: HDFS + MapReduce, single filesystem and single execution space
MapReduce is used for analytical and/or batch processing
Hadoop ecosystem: Chukwa, HBase, HDFS, Hive, Mapreduce, Pig, ZooKeeper,…
Benefit or parallellisation, more ad-hoc processing, compartmentalized approach reduces operational risk



  • key-value stores

    focus on scaling huge amounts of data

    – vmware
    – very fast but mostly one server

    – LinkedIn
    – persistent distributed
    – fault-tolerant
    – java based

  • column stores

    BigTable clones
    sparse tables
    data model: columns->column families->cells


    – Stumbleupon, Adobe, Cloudera
    – sorted
    – distributed
    – highly-available
    – high performance
    – multi-dimensional (timestamp)
    – persisted
    – random access layer on HDFS
    – has a central master node

    – Rackspace, Facebook
    – key-value with added structure
    – reliability (no master node)
    – eventual consistent
    – distributed
    – tunable partitioning and replication
    – PRO linear scale, write optimized
    – CON 1 row must fit in ram, only pk based querying

  • document databases

    Lotus Notes heritage
    key-value stores but DB knows what the value is
    documents often versioned
    collections of key-value collections

    – fault tolerant
    – schema-free
    – document oriented
    – RESTful HTTP interface
    – document is a JSON object
    – view system is MapReduce based, Filter, Collate, Aggregate, all javascript
    – out-of-the box all data needs to fit on one machine

    – like CouchDB
    – C++
    – performance focus
    – native drivers
    – auto sharing (alpha)


  • graph databases

    data is nodes + relationships + key/value properties

    – mostly RAM centric
    – SPARQL/SAIL implementation
    – scaling to complexity (rather than volume?)
    – ‘whiteboard” friendly
    – many language bindings
    – little remoting

Leave a Reply

Your email address will not be published. Required fields are marked *

question razz sad evil exclaim smile redface biggrin surprised eek confused cool lol mad twisted rolleyes wink idea arrow neutral cry mrgreen