[MapR培训笔记] Hadoop生态系统

SQL: Language designed for querying & transformingdata held in a relational database management system.
NoSQL: Database model that’s not necessaryly held intabular format. Often, NoSQL refers to data that is flat or nestedin format.
Log Data: Information captured about organization’sinternal system, external customer interactions & how they’reused
Streaming Data: Twitter, Facebook, Web click data, Webform data.
Flume: Reliable, scalable service used to collectstreaming data in Hadoop cluster.
Sqoop: Transfers data between external data store &Hadoop cluster. Command line tool used with both RDBMS & NoSQLdata.
NFS: Distributed file system protocal allowing computerto access files over network as through they were locally mountedstorage.
Kafka: Fast, scalable open source messaging servicedistributing log data over a cluster.
Map Reduce: Multi step process splitting large data setinto parts, & performing functions to produce finalresult.
Spark: General data analytucs on distributed computingcluster. Provides in-memory computation, enhances speed of dataprocessing over MapReduce.
Drill: Query engine for Big Data exploration. Performsdynamic queries on self-describing data in files (JSON, parquest,text, MapR-DB and HBase tables).
Mahout: Scable machine learning libraries. Corealgorithm supporting data clustering, classfication,categorization, collaborative filter (recommendation engine).
Oozie: Scalable & extensible scheduling systemcreating complex Hadoop workflows. Manages multiple job types inworkflow (MapReduce, Pig, Hive, Sqoop, distcp, script & Javaprogram).
Hive: Data warehouse instructure. SQL-like syntax forqueries using MapReduce. HiveQL provides operations similar to SQL,transformed into MapREduce program for query results.
Pig: Platform for analyzing data. Consists of data flowscripting language called ‘Pig Latin’ & Instructure forconverting scripts into sequence of MapReduce program.
HBase: Hadoop database, provided random, realtime,read/write access to very large data.
Elatic search: Scalable search server build on top ofLucene that support real-time or near real time search.
Solr: Open source search platform, part of the ApacheLucene project.
SiLK: Interface for Lucene based Solr search platform.Provides tool to perform searches & visualize results in adashboard with reports.
Web Tier: Web site delivery of big data processingresults directly to consumer (new song recommendation base on whatthe consumer hsa listened to previsouly).
Banana: Based on Kibana. Used to create dashboards forvisualizing data stored in Solr.
Kibana: Browser based dashboard for search & dataanalytics. But with HTM: & JavaScript. Easy to set up &used with just a web server.
Data Warehouse: Centralized repository for storing datafrom multiple sources (marketing, finance, human resources, websites, external data stores).