试验环境搭建
安装Spark环境
犯懒,直接使用GitHub: martinprobson/vagrant-hadoop-hive-spark通过Vagrant搭建了一个Hadoop 2.7.6 + Hive 2.3.3 + Spark 2.3.0的虚拟机环境。
在Hive上加载emp表
hive> create table emp (empno int, ename string, job string, mgr int, hiredate string, salary double, comm double, deptno int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' ;
hive> LOAD DATA LOCAL INPATH '/usr/local/hive/examples/files/emp2.txt' OVERWRITE INTO TABLE emp;
安装Gradle
按照Gradle用户手册中的方式手工安装Gradle:
vagrant@node1:~$ export GRADLE_HOME=/opt/gradle/gradle-4.8.1
vagrant@node1:~$ export PATH=$PATH:$GRADLE_HOME/bin
vagrant@node1:~$ gradle -v
Welcome to Gradle 4.8.1!
Here are the highlights of this release:
- Dependency locking
- Maven Publish and Ivy Publish plugins improved and marked stable
- Incremental annotation processing enhancements
- APIs to configure tasks at creation time
For more details see https://docs.gradle.org/4.8.1/release-notes.html
------------------------------------------------------------
Gradle 4.8.1
------------------------------------------------------------
Build time: 2018-06-21 07:53:06 UTC
Revision: 0abdea078047b12df42e7750ccba34d69b516a22
Groovy: 2.4.12
Ant: Apache Ant(TM) version 1.9.11 compiled on March 23 2018
JVM: 1.8.0_171 (Oracle Corporation 25.171-b11)
OS: Linux 4.4.0-128-generic amd64
Spark项目
目录结构
vagrant@node1:~/HelloSparkHive$ tree
.
├── build.gradle
└── src
└── main
└── java
└── com
└── yqu
└── sparkhive
└── HelloSparkHiveDriver.java
6 directories, 2 files
build.gradle
apply plugin: 'java-library'
apply plugin: 'eclipse'
apply plugin: 'idea'
repositories {
jcenter()
}
sourceCompatibility = 1.8
targetCompatibility = 1.8
dependencies {
compileOnly 'org.apache.spark:spark-core_2.11:2.3.0'
compileOnly 'org.apache.spark:spark-sql_2.11:2.3.0'
testImplementation 'org.apache.spark:spark-core_2.11:2.3.0','junit:junit:4.12'
}
jar {
baseName = 'hello-spark-hive'
version = '0.1.0'
}
src/main/java/com/yqu/sparkhive/HelloSparkHiveDriver.java
该范例仅将emp表加载成DataSet,然后由DataSet创建临时视图,由临时视图创建新表,由此完成Hive读写示例。
package com.yqu.sparkhive;
import java.io.File;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class HelloSparkHiveDriver {
public static void main(String args[]) {
if(args.length==0) {
System.out.println("Please provide target hive table name!");
}
// warehouseLocation points to the default location
// for managed databases and tables
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
SparkSession spark = SparkSession
.builder()
.appName("
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate();
// Queries are expressed in HiveQL
Dataset sqlDF = spark.sql("SELECT * FROM emp");
System.out.println("emp content:");
sqlDF.show();
sqlDF.createOrReplaceTempView("yquTempTable");
spark.sql("create table "+args[0]+" as select * from yquTempTable");
System.out.println(args[0]+" content:");
spark.sql("SELECT * FROM "+args[0]).show();
spark.close();
}
}
构建及测试
vagrant@node1:~/HelloSparkHive$ gradle build jar
BUILD SUCCESSFUL in 0s
2 actionable tasks: 2 executed
vagrant@node1:~/HelloSparkHive$ spark-submit --class com.yqu.sparkhive.HelloSparkHiveDriver --deploy-mode client --master local[2] /home/vagrant/HelloSparkHive/build/libs/hello-spark-hive-0.1.0.jar yqu1
vagrant@node1:~/HelloSparkHive$ hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/apache-hive-2.3.3-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/apache-tez-0.9.1-bin/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.7.6/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://
18/07/10 07:55:18 [main]: WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Connected to: Apache Hive (version 2.3.3)
Driver: Hive JDBC (version 2.3.3)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.3 by Apache Hive
hive> use default;
OK
No rows affected (0.931 seconds)
hive> show tables;
OK
emp
yqu1
2 rows selected (0.229 seconds)
hive> select * from yqu1;
18/07/02 07:55:37 [XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX main]: ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
OK
7369 SMITH CLERK 7902 1980-12-17 800.0
7499 ALLEN SALESMAN 7698 1981-02-20 1600.0 300
7521 WARD SALESMAN 7698 1981-02-22 1250.0 500
7566 JONES MANAGER 7839 1981-04-02 2975.0
7654 MARTIN SALESMAN 7698 1981-09-28 1250.0 1400
7698 BLAKE MANAGER 7839 1981-05-01 2850.0
7782 CLARK MANAGER 7839 1981-06-09 2450.0
7788 SCOTT ANALYST 7566 1982-12-09 3000.0
7839 KING PRESIDENT 1981-11-17 5000.0
7844 TURNER SALESMAN 7698 1981-09-08 1500.0 0
7876 ADAMS CLERK 7788 1983-01-12 1100.0
7900 JAMES CLERK 7698 1981-12-03 950.0
7902 FORD ANALYST 7566 1981-12-03 3000.0
7934 MILLER CLERK 7782 1982-01-23 1300.0
7988 KATY ANALYST 7566 NULL 1500.0
7987 JULIA ANALYST 7566 NULL 1500.0
16 rows selected (1.546 seconds)
hive>
参考
Getting started Apache Spark with Java
The Java Library Plugin
geekmj/apache-spark-examples
saagie/example-spark-scala-read-and-write-from-hive