BeJUG talk, NoSQL with Hadoop and Hbase, Steven Noels

Notes are a little bit cryptic, but still…

NoSQL with HBase and Hadoop, Steven Noels, Bejug 17.06.2010

Intro

“An evolution drive by pain”
Various types of databases, standardized to RDBMS, further simplified to ORM frameworks

We are now living in a world with massive data stores, with caching, denormalization, sharding, replication,… There came a need to rething the problem, resulting in NoSQL.

Four trends:
– data size, every two years more data is created than existed before
– connectedness, more and more linking between data
– semi-structure,
– architecture, from single client for data, to multiple applications on data (make the db an integration hub), to decoupled services with their own back-end (not mentioned, but the next step will be integration of the back-ends)

Data management was a cost (hardware, DBA, infrastructure people, DB licenses,…)
Moving to considering data as an opportunity to learn about your customers, so you should capture as much as you can.

It is a Cambrian explosion (lot’s of evolution/new species, but only the tough/best will survive):
HBase, Cassandra, CouchDB, neo4j, riak, Redis, MongoDB,…

Some solutions may no longer exist in a couple of years, and some will become better and popular.

Common themes:
– scale, sscale, scale
– new data models
– devops, more interaction between developers, dba, infrastructure
– N-O-SQL, not only SQL
– cloud: technology is of no interest any more

New data:
– Sparse structures
– weak schemas
– graphs
– semi-structures
– document oriented

NoSQL
– not a movement
– not ANSI NoSQL-2010, there is no standard and it not expected there soon will be
– not one size fits all
– not (necessarily) anti-RDBMS
– not a silver bullet

NoSQL is pro choice

Use NoSQL for…
– horizontal scale (out instead of up)
– unusually common data (free structured)
– speed (especially for writes)
– the bleeding edge

Use SQL/RDBMs for…
– SQL
– ACID
– normalization
– a defined liability

Theory

See also Google Bigtable and Amazon Dynamo papers, Eric Brewer’s CAP theorem
discuss NoSQL papers : nosqlsummer.org

Dynamo: coined the term “eventual consistency”, consistent hashing
Bigtable: multi-dimensional column oriented database, on top of GoogleFileSystem, object versioning
CAP: you can only have two out of three of “string consistency”, “high availability”, “partition tolerance”

Difference between ACID (rdmb, pessimistic, strong consistency, less available, complex, analuzable) and BASE (availability and scaling highest priority, weak consistency, optimistic, best effort, simple and fast)

Hadoop: HDFS + MapReduce, single filesystem and single execution space
MapReduce is used for analytical and/or batch processing
Hadoop ecosystem: Chukwa, HBase, HDFS, Hive, Mapreduce, Pig, ZooKeeper,…
Benefit or parallellisation, more ad-hoc processing, compartmentalized approach reduces operational risk

Technology

Types:

  • key-value stores

    focus on scaling huge amounts of data

    Regis
    – vmware
    – very fast but mostly one server

    Voldemort
    – LinkedIn
    – persistent distributed
    – fault-tolerant
    – java based

  • column stores

    BigTable clones
    sparse tables
    data model: columns->column families->cells

    BigTable

    HBase
    – Stumbleupon, Adobe, Cloudera
    – sorted
    – distributed
    – highly-available
    – high performance
    – multi-dimensional (timestamp)
    – persisted
    – random access layer on HDFS
    – has a central master node

    Cassandra
    – Rackspace, Facebook
    – key-value with added structure
    – reliability (no master node)
    – eventual consistent
    – distributed
    – tunable partitioning and replication
    – PRO linear scale, write optimized
    – CON 1 row must fit in ram, only pk based querying

  • document databases

    Lotus Notes heritage
    key-value stores but DB knows what the value is
    documents often versioned
    collections of key-value collections

    CouchDB
    – fault tolerant
    – schema-free
    – document oriented
    – RESTful HTTP interface
    – document is a JSON object
    – view system is MapReduce based, Filter, Collate, Aggregate, all javascript
    – out-of-the box all data needs to fit on one machine

    MongoDB
    – like CouchDB
    – C++
    – performance focus
    – native drivers
    – auto sharing (alpha)

    Riak

  • graph databases

    data is nodes + relationships + key/value properties

    neo4j
    – mostly RAM centric
    – SPARQL/SAIL implementation
    – scaling to complexity (rather than volume?)
    – ‘whiteboard” friendly
    – many language bindings
    – little remoting

Leave a Reply

Your email address will not be published. Required fields are marked *

question razz sad evil exclaim smile redface biggrin surprised eek confused cool lol mad twisted rolleyes wink idea arrow neutral cry mrgreen

*