Big Data is (at least) Four Different Problems 2016

Stanford Seminar - Big Data is (at least) Four Different Problems 2016

The Meaning of Big Data: big volume, big velocity, big variety.

Big Volume

  • well addressed by the data warehouse crowd - multi-node column stores with sophisticated sompression. But performance among products differs by a lot.
  • NVRAM (non-volatile memory).
  • Network no longer the “high pole in the tent”.
  • More data scientist, more data or better model?

Data Science: Complex analytics: always array based.

Until (tired) {
  Data management;
  Complex analytics (regression, clustering, bayesian analysis, ...);
}

Options:

  1. Map-Reduce (Hadoop, not flexible) for HDFS (Hadoop Distributed File System, too slow) data.
    • (1) map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).
    • (2) reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples.
    • 2011~2015 abandoned by google.
    • 2013 95% of Facebook access is SQL (Hive).
    • 2013 Cloudera : (3-level stack) SQL, Map-Reduce, HDFS.
    • 2014 Impala : (2-level stack) SQL, HDFS -> (1-level stack) SQL.
    • Data warehouse market and Hadoop market are merging. HDFS is being marketed to support “data lakes”.
  2. Spark. No persistence (not a file system, but a framework), No sharing, No indices.
    • future : DBMSs as a persistence layer under Spark.
  3. Move the query to the data, but tightly coupling to analytics.
  4. Array DBMS (database manager system). (SciDB; SciDB-R)

Big Velocity

Options:

  1. Big pattern - little state (electronic trading). Complex event processing (CEP) (Storm, Kafka, StreamBase) - Patterns in a firehouse.
  2. Big state - little pattern. ~OLTP, NewSQL engines (VoltDB, NuoDB, MemSQL …).

Possible storm clouds: RDMA, Spanner…

Big Variety

No global data model. Our daily data not get into the data warehouse; public data from the web; company data… Extract, transform, and load (ETL) programmer, cannot handle large amount of data source.

Data Integration hard.

RocksDB Is Eating the Database World 2020

RocksDB Is Eating the Database World 2020

Log Structured Merge Trees, after the paper 1996

Redis