trying to solve IT problems

How I tried to fix certain programming problems, mostly in the java, JEE, JBoss scene, web area and using Ubuntu or Debian linux.

  • Home
  • Geomajas / GIS
  • About
Twitter RSS

BeJUG talk, NoSQL with Hadoop and Hbase, Steven Noels

Posted on June 18, 2010 by joachim
No CommentsLeave a comment

Notes are a little bit cryptic, but still…

NoSQL with HBase and Hadoop, Steven Noels, Bejug 17.06.2010

Intro

“An evolution drive by pain”
Various types of databases, standardized to RDBMS, further simplified to ORM frameworks

We are now living in a world with massive data stores, with caching, denormalization, sharding, replication,… There came a need to rething the problem, resulting in NoSQL.

Four trends:
- data size, every two years more data is created than existed before
- connectedness, more and more linking between data
- semi-structure,
- architecture, from single client for data, to multiple applications on data (make the db an integration hub), to decoupled services with their own back-end (not mentioned, but the next step will be integration of the back-ends)

Data management was a cost (hardware, DBA, infrastructure people, DB licenses,…)
Moving to considering data as an opportunity to learn about your customers, so you should capture as much as you can.

It is a Cambrian explosion (lot’s of evolution/new species, but only the tough/best will survive):
HBase, Cassandra, CouchDB, neo4j, riak, Redis, MongoDB,…

Some solutions may no longer exist in a couple of years, and some will become better and popular.

Common themes:
- scale, sscale, scale
- new data models
- devops, more interaction between developers, dba, infrastructure
- N-O-SQL, not only SQL
- cloud: technology is of no interest any more

New data:
- Sparse structures
- weak schemas
- graphs
- semi-structures
- document oriented

NoSQL
- not a movement
- not ANSI NoSQL-2010, there is no standard and it not expected there soon will be
- not one size fits all
- not (necessarily) anti-RDBMS
- not a silver bullet

NoSQL is pro choice

Use NoSQL for…
- horizontal scale (out instead of up)
- unusually common data (free structured)
- speed (especially for writes)
- the bleeding edge

Use SQL/RDBMs for…
- SQL
- ACID
- normalization
- a defined liability

Theory

See also Google Bigtable and Amazon Dynamo papers, Eric Brewer’s CAP theorem
discuss NoSQL papers : nosqlsummer.org

Dynamo: coined the term “eventual consistency”, consistent hashing
Bigtable: multi-dimensional column oriented database, on top of GoogleFileSystem, object versioning
CAP: you can only have two out of three of “string consistency”, “high availability”, “partition tolerance”

Difference between ACID (rdmb, pessimistic, strong consistency, less available, complex, analuzable) and BASE (availability and scaling highest priority, weak consistency, optimistic, best effort, simple and fast)

Hadoop: HDFS + MapReduce, single filesystem and single execution space
MapReduce is used for analytical and/or batch processing
Hadoop ecosystem: Chukwa, HBase, HDFS, Hive, Mapreduce, Pig, ZooKeeper,…
Benefit or parallellisation, more ad-hoc processing, compartmentalized approach reduces operational risk

Technology

Types:

  • key-value stores

    focus on scaling huge amounts of data

    Regis
    - vmware
    - very fast but mostly one server

    Voldemort
    - LinkedIn
    - persistent distributed
    - fault-tolerant
    - java based

  • column stores

    BigTable clones
    sparse tables
    data model: columns->column families->cells

    BigTable

    HBase
    - Stumbleupon, Adobe, Cloudera
    - sorted
    - distributed
    - highly-available
    - high performance
    - multi-dimensional (timestamp)
    - persisted
    - random access layer on HDFS
    - has a central master node

    Cassandra
    - Rackspace, Facebook
    - key-value with added structure
    - reliability (no master node)
    - eventual consistent
    - distributed
    - tunable partitioning and replication
    - PRO linear scale, write optimized
    - CON 1 row must fit in ram, only pk based querying

  • document databases

    Lotus Notes heritage
    key-value stores but DB knows what the value is
    documents often versioned
    collections of key-value collections

    CouchDB
    - fault tolerant
    - schema-free
    - document oriented
    - RESTful HTTP interface
    - document is a JSON object
    - view system is MapReduce based, Filter, Collate, Aggregate, all javascript
    - out-of-the box all data needs to fit on one machine

    MongoDB
    - like CouchDB
    - C++
    - performance focus
    - native drivers
    - auto sharing (alpha)

    Riak

  • graph databases

    data is nodes + relationships + key/value properties

    neo4j
    - mostly RAM centric
    - SPARQL/SAIL implementation
    - scaling to complexity (rather than volume?)
    - ‘whiteboard” friendly
    - many language bindings
    - little remoting

Be Sociable, Share!
  • Tweet
Categories: architecture, java, web development
Throwing complexity over the wall, easy in-container testing, Shrinkwrap and Arquillian, Andrew Rubinger
BeJUG talk, Google App Engine, Paul Bakker

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

*

*


question razz sad evil exclaim smile redface biggrin surprised eek confused cool lol mad twisted rolleyes wink idea arrow neutral cry mrgreen

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

  • Recent Posts

    • Plug-ins and faces overview for Geomajas
    • Adding an admin user in JIRA with embedded database
    • The Geomajas project release releases back-end 1.10, three independent projects and 15 plug-ins
    • Spatial data in the enterprise BeJUG session
    • Ktunaxa Referral Management System
  • Recent Comments

    • Eric on delete windows service account
    • nandra indonesia on delete windows service account
    • joachim on CXF ws client, dynamic endpoint and loading WSDL from the classpath
    • TXMaster on CXF ws client, dynamic endpoint and loading WSDL from the classpath
    • TXMaster on CXF ws client, dynamic endpoint and loading WSDL from the classpath
  • Categories

    • architecture
    • competencies
    • equanda
    • geomajas / GIS
    • java
    • jboss
    • maven
    • semantics
    • tapestry5
    • ubuntu / debian / linux
    • Uncategorized
    • web development
    • web services
© trying to solve IT problems. Proudly Powered by WordPress | Nest Theme by YChong