impala的基本使用

程序员文章站 2022-07-11 17:55:16

...

impala的基本使用

impala介绍
impala的使用
impala-shell的外部命令参数语法
impala-shell的内部命令行参数语法
创建数据库
impala的java开发

impala介绍

impala是cloudera提供的一款高效率的sql查询工具，提供实时的查询效果，官方测试性能比hive快3到10倍，其sql查询比sparkSQL还要更加快速，号称是当前大数据领域最快的查询sql工具，

impala与hive的关系
impala是基于hive的大数据分析查询引擎，直接使用hive的元数据库metadata，意味着impala元数据都存储在hive的metastore当中，并且impala兼容hive的绝大多数sql语法。所以需要安装impala的话，必须先安装hive，保证hive安装成功，并且还需要启动hive的metastore服务

impala的优点
1、impala比较快，非常快，特别快，因为所有的计算都可以放入内存当中进行完成，只要你内存足够大
2、摈弃了MR的计算，改用C++来实现，有针对性的硬件优化
3、具有数据仓库的特性，对hive的原有数据做数据分析
4、支持ODBC，jdbc远程访问
impala的缺点：
1、基于内存计算，对内存依赖性较大
2、改用C++编写，意味着维护难度增大
3、基于hive，与hive共存亡，紧耦合
4、稳定性不如hive，不存在数据丢失的情况

impala的使用

impala-shell的外部命令参数语法

不需要进入到impala-shell交互命令行当中即可执行的命令参数
impala-shell后面执行的时候可以带很多参数：

-h 查看帮助文档
impala-shell -h
-r 刷新整个元数据，
数据量大的时候，比较消耗服务器性能
impala-shell -r
-B 去格式化，查询大量数据可以提高性能
–print_header 去格式化显示列名
–output_delimiter 指定分隔符
-v 查看对应版本
impala-shell -v -V

-f 执行查询文件
–query_file 指定查询文件
cd /export/servers
vim impala-shell.sql
use weblog;
select * from ods_click_pageviews limit 10;
通过-f 参数来执行执行的查询文件
impala-shell -f impala-shell.sql

-i 连接到impalad
–impalad 指定impalad去执行任务
-o 保存执行结果到文件当中去
–output_file 指定输出文件名
impala-shell -f impala-shell.sql -o hello.txt

-p 显示查询计划
impala-shell -f impala-shell.sql -p

-q 不使用impala-shell进行查询
impala的基本使用

impala-shell的内部命令行参数语法

进入impala-shell命令行之后可以执行的语法
help命令

connect命令
connect hostname 连接到某一台机器上面去执行
refresh 命令
refresh dbname.tablename 增量刷新，刷新某一张表的元数据，主要用于刷新hive当中数据表里面的数据改变的情况
refresh mydb.stu;

invalidate metadata 命令：
invalidate metadata全量刷新，性能消耗较大，主要用于hive当中新建数据库或者数据库表的时候来进行刷新

explain 命令：
用于查看sql语句的执行计划
explain select * from stu;
explain的值可以设置成0,1,2,3等几个值，其中3级别是最高的，可以打印出最全的信息
set explain_level=3;
profile命令：
执行sql语句之后执行，可以打印出更加详细的执行步骤，
主要用于查询结果的查看，集群的调优等
select * from stu;
profile;

注意:在hive窗口当中插入的数据或者新建的数据库或者数据库表，在impala当中是不可直接查询到的，需要刷新数据库，在impala-shell当中插入的数据，在impala当中是可以直接查询到的，不需要刷新数据库，其中使用的就是catalog这个服务的功能实现的，catalog是impala1.2版本之后增加的模块功能，主要作用就是同步impala之间的元数据

创建数据库

查看所有数据库
show databases;
创建与删除数据库
创建数据库
CREATE DATABASE IF NOT EXISTS mydb1;
drop database if exists mydb;

创建数据库表并指定数据库表数据存放hdfs的位置（与hive建表语法类似）
hdfs dfs -mkdir -p /input/impala
create external table t3(id int ,name string ,age int ) row format delimited fields terminated by ‘\t’ location ‘/input/impala/external’;

创建数据库表
创建student表
CREATE TABLE IF NOT EXISTS mydb1.student (name STRING, age INT, contact INT );
创建employ表
create table employee (Id INT, name STRING, age INT,address STRING, salary BIGINT);

数据库表中插入数据
insert into employee (ID,NAME,AGE,ADDRESS,SALARY)VALUES (1, ‘Ramesh’, 32, ‘Ahmedabad’, 20000 );
insert into employee values (2, ‘Khilan’, 25, ‘Delhi’, 15000 );

据的查询
select * from employee;
select name,age from employee;
删除表
DROP table mydb1.employee;
清空表数据
truncate employee;
创建视图
CREATE VIEW IF NOT EXISTS employee_view AS select name, age from employee;
查看视图数据
select * from employee_view;
order by语句
基础语法
select * from table_name ORDER BY col_name [ASC|DESC] [NULLS FIRST|NULLS LAST]
Select * from employee ORDER BY id asc;

group by 语句
Select name, sum(salary) from employee Group BY name;

having 语句
基础语法
select * from table_name ORDER BY col_name [ASC|DESC] [NULLS FIRST|NULLS LAST]
按年龄对表进行分组，并选择每个组的最大工资，并显示大于20000的工资
select max(salary) from employee group by age having max(salary) > 20000;

limit语句
select * from employee order by id limit 4;

impala当中的数据表导入几种方式
第一种方式，通过load hdfs的数据到impala当中去
create table user(id int ,name string,age int ) row format delimited fields terminated by “\t”;
准备数据user.txt并上传到hdfs的 /user/impala路径下去
1 hello 15
2 zhangsan 20
3 lisi 30
4 wangwu 50

加载数据
load data inpath ‘/user/impala/’ into table user;

查询加载的数据
select * from user;
如果查询不不到数据，那么需要刷新一遍数据表
refresh user;
第二种方式：
create table user2 as select * from user;
第三种方式：
insert into
第四种：
insert into select

impala的java开发

在实际工作当中，因为impala的查询比较快，所以可能有会使用到impala来做数据库查询的情况，我们可以通过java代码来进行操作impala的查询
第一步：导入jar包

在实际工作当中，因为impala的查询比较快，所以可能有会使用到impala来做数据库查询的情况，我们可以通过java代码来进行操作impala的查询
第一步：导入jar包
  <repositories>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
        <repository>
            <id>central</id>
            <url>http://repo1.maven.org/maven2/</url>
            <releases>
                <enabled>true</enabled>
            </releases>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </repository>
    </repositories>



    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.6.0-cdh5.14.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-common</artifactId>
            <version>1.1.0-cdh5.14.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-metastore</artifactId>
            <version>1.1.0-cdh5.14.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-service</artifactId>
            <version>1.1.0-cdh5.14.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>1.1.0-cdh5.14.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>1.1.0-cdh5.14.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.thrift/libfb303 -->
        <dependency>
            <groupId>org.apache.thrift</groupId>
            <artifactId>libfb303</artifactId>
            <version>0.9.0</version>
            <type>pom</type>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.thrift/libthrift -->
        <dependency>
            <groupId>org.apache.thrift</groupId>
            <artifactId>libthrift</artifactId>
            <version>0.9.0</version>
            <type>pom</type>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.2.5</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpcore</artifactId>
            <version>4.2.5</version>
        </dependency>

    </dependencies>

第二步：impala的java代码查询开发

public class ImpalaJdbc {
    public static void main(String[] args) throws Exception {
    //定义连接驱动类，以及连接url和执行的sql语句
    String driver = "org.apache.hive.jdbc.HiveDriver";
    String driverUrl = "jdbc:hive2://192.168.52.120:21050/mydb1;auth=noSasl";
    String sql = "select * from student";

    //通过反射加载数据库连接驱动
    Class.forName(driver);
    Connection connection = DriverManager.getConnection(driverUrl);
    PreparedStatement preparedStatement = connection.prepareStatement(sql);
    ResultSet resultSet = preparedStatement.executeQuery();
    //通过查询，得到数据一共有多少列
    int col = resultSet.getMetaData().getColumnCount();
    //遍历结果集
    while (resultSet.next()){
        for(int i=1;i<=col;i++){
            System.out.print(resultSet.getString(i)+"\t");
        }
        System.out.print("\n");
    }
    preparedStatement.close();
    connection.close();
}
}

impala的基本使用