hive sql分区表

nanjingjiangbiao_T

浏览: 2599925 次
来自: 深圳

最近访客更多访客>>

XiaoPY

ma_haile

morelily

rocex

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (3415)

社区版块

存档分类

2013-02 ( 103)
2013-01 ( 358)
2012-12 ( 205)
更多存档...

很不错的常见操作，总结的不错！

hive> create table lpx_partition_test(global_id int, company_name string)partitioned by (stat_date string, province string) row format delimited fields terminated by ',';

Time taken: 0.114 seconds

由此可见hive sql中的分区列并不是一个实际存在的列，可以说是一个或多个伪列。

hive> desc extended lpx_partition_test;
OK
global_id int
company_name string
stat_date string
province string

Detailed Table Information Table(tableName:lpx_partition_test, dbName:default, owner:root, createTime:1312186275, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:global_id, type:int, comment:null), FieldSchema(name:company_name, type:string, comment:null), FieldSchema(name:stat_date, type:string, comment:null), FieldSchema(name:province, type:string, comment:null)], location:hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[FieldSchema(name:stat_date, type:string, comment:null), FieldSchema(name:province, type:string, comment:null)], parameters:{transient_lastDdlTime=1312186275}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.111 seconds

该例子中创建了stat_date和province作为分区列。和oracle 中类似，要先创建了分区，才可以插入数据。分区创建成功后在hdfs上会创建对应的文件。

hive> alter table lpx_partition_test add PARTITION(stat_date='2011-06-08', province='ZheJiang');
OK
Time taken: 0.464 seconds

hive> alter table lpx_partition_test add PARTITION(stat_date='2011-06-08', province='GuangDong');
OK
Time taken: 7.746 seconds
hive> alter table lpx_partition_test add PARTITION(stat_date='2011-06-09', province='ZheJiang');
OK
Time taken: 0.235 seconds

root@hadoop1:/opt/hadoop# bin/hadoop dfs -ls /user/hive/warehouse/lpx_partition_test
Found 2 items
drwxr-xr-x - root supergroup 0 2011-08-01 16:42 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08
drwxr-xr-x - root supergroup 0 2011-08-01 16:42 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-09

root@hadoop1:/opt/hadoop# bin/hadoop dfs -ls /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08
Found 2 items
drwxr-xr-x - root supergroup 0 2011-08-01 16:42 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=GuangDong
drwxr-xr-x - root supergroup 0 2011-08-01 16:37 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang

由此可见，每个分区都有一个独立的文件对应，stat_date位于父层级，province位于子层级。

向分区中插入数据：
hive> drop table lpx_partition_test_in;
OK
Time taken: 6.971 seconds
hive> create table lpx_partition_test_in(global_id int, company_name string, province string)row format delimited fields terminated by ' ';
OK
Time taken: 0.275 seconds
hive> LOAD DATA LOCAL INPATH '/opt/hadoop/mytest/lpx_partition_test.txt' OVERWRITE INTO TABLE lpx_partition_test_in;
Copying data from file:/opt/hadoop/mytest/lpx_partition_test.txt
Copying file: file:/opt/hadoop/mytest/lpx_partition_test.txt
Loading data to table default.lpx_partition_test_in
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test_in
OK
Time taken: 0.428 seconds

hive> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province='ZheJiang') select global_id, company_name from lpx_partition_test_in where province='ZheJiang';
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/root/root_20110801172929_4b36ae2a-9d00-4432-8746-7b4d62aa8378.log
Job running in-process (local Hadoop)
2011-08-01 17:29:35,384 null map = 100%, reduce = 0%
Ended Job = job_local_0001
Ended Job = -1620577194, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_17-29-30_013_2844131263666576737/-ext-10000
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=ZheJiang)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang
Partition default.lpx_partition_test{stat_date=2011-06-08, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 1, num_files: 1, num_rows: 3, total_size: 60]
OK
Time taken: 6.524 seconds

hive> select * from lpx_partition_test;
OK
99001 xxxcompany_name1 2011-06-08 ZheJiang
99002 xxxcompany_name1 2011-06-08 ZheJiang
99003 xxxcom2 2011-06-08 ZheJiang
Time taken: 0.559 seconds

hive> from lpx_partition_test_in
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province='ZheJiang') select global_id, company_name where province='ZheJiang'
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province='GuangDong') select global_id, company_name where province='GuangDong'
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-09', province='ZheJiang') select global_id, company_name where province='ZheJiang'
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-09', province='GuangDong') select global_id, company_name where province='GuangDong';
Total MapReduce jobs = 5
Launching Job 1 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/root/root_20110801180606_1dc94690-8e64-41cc-a4d7-30e927408f30.log
Job running in-process (local Hadoop)
2011-08-01 18:06:22,147 null map = 0%, reduce = 0%
2011-08-01 18:06:23,149 null map = 100%, reduce = 0%
Ended Job = job_local_0001
Ended Job = 1501179483, job is filtered out (removed at runtime).
Ended Job = -24922011, job is filtered out (removed at runtime).
Ended Job = -2114178998, job is filtered out (removed at runtime).
Ended Job = 1437573638, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10000
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10002
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10004
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10006
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=ZheJiang)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang
Partition default.lpx_partition_test{stat_date=2011-06-08, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 1, num_files: 1, num_rows: 3, total_size: 60]
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-09, province=ZheJiang)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-09/province=ZheJiang
Partition default.lpx_partition_test{stat_date=2011-06-09, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 2, num_files: 2, num_rows: 6, total_size: 120]
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=GuangDong)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=GuangDong
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-09, province=GuangDong)
Partition default.lpx_partition_test{stat_date=2011-06-09, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Table default.lpx_partition_test stats: [num_partitions: 3, num_files: 3, num_rows: 7, total_size: 143]
Partition default.lpx_partition_test{stat_date=2011-06-08, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Table default.lpx_partition_test stats: [num_partitions: 4, num_files: 4, num_rows: 8, total_size: 166]
OK
Time taken: 8.778 seconds

hive> select * from lpx_partition_test;
OK
99001 xxxcompany_name1 2011-06-08 GuangDong
99001 xxxcompany_name1 2011-06-08 ZheJiang
99002 xxxcompany_name1 2011-06-08 ZheJiang
99003 xxxcom2 2011-06-08 ZheJiang
99001 xxxcompany_name1 2011-06-09 GuangDong
99001 xxxcompany_name1 2011-06-09 ZheJiang
99002 xxxcompany_name1 2011-06-09 ZheJiang
99003 xxxcom2 2011-06-09 ZheJiang
Time taken: 0.356 seconds

--动态分区
如果有大量的数据需要插入到不同的分区，需要对每一个分区都写一条insert语句，必须使用大量的insert语句。
为了加载某一天各个省份的分区数据，必须为每一个省份写一条insert语句，使用起来非常不方便。如果需要插入另外一天的数据，必须要修改DML语句和DDL语句，而且每个insert语句转化为MapReduce任务也非常不方便。动态分区，用来解决这个问题，它可以根据输入数据来动态决定如何创建哪个区分和将数据放入哪个分区。这个特性从0.6.0版开始有。在动态插入过程中，输入列值被评估并动态地决定要插入的分区。如果相应的分区没有被创建，它会自动地创建该分区。使用这个特性，我们可以使用一条insert语句来创建和写入所有必要的分区。另外，因为只有一条sql语句，所以对应地只有一个MapReduce任务。这可以极大地提升性能并且减少Hadoop聚类的负载。

动态分区参数：
hive.exec.max.dynamic.partitions.pernode：每个mapper or reducer创建动态分区的最大数量，小于100.
hive.exec.max.dynamic.partitions：每个DML语句可以创建的动态分区的数量，小于1000.
hive.exec.max.created.files:所有smapper or reducer创建文件的最大数量，小于100000.

hive> set hive.exec.dynamic.partition;
hive.exec.dynamic.partition=false
hive> set hive.exec.dynamic.partition = true;
hive> set hive.exec.dynamic.partition;
hive.exec.dynamic.partition=true

hive> from lpx_partition_test_in
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province) select global_id, company_name,province;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/root/root_20110801183737_64ce8cf1-a068-4fbf-9d8e-561118569b2c.log
Job running in-process (local Hadoop)
2011-08-01 18:37:57,566 null map = 100%, reduce = 0%
Ended Job = job_local_0001
Ended Job = -1141443727, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-37-51_921_8609501383674778354/-ext-10000
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=null)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=GuangDong
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang
Loading partition {stat_date=2011-06-08, province=GuangDong}
Loading partition {stat_date=2011-06-08, province=ZheJiang}
Partition default.lpx_partition_test{stat_date=2011-06-08, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Partition default.lpx_partition_test{stat_date=2011-06-08, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 4, num_files: 4, num_rows: 8, total_size: 166]
OK
Time taken: 6.683 seconds

hive> from lpx_partition_test_in
> insert overwrite table lpx_partition_test PARTITION(stat_date, province) select global_id, company_name,stat_date,province;
FAILED: Error in semantic analysis: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> from lpx_partition_test_in t
> insert overwrite table lpx_partition_test PARTITION(stat_date, province) select t.global_id, t.company_name, t.stat_date, t.province DISTRIBUTE BY t.stat_date, t.province;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Execution log at: /tmp/root/root_20110802131616_02744950-1c88-4073-8aae-07c964073c1a.log
Job running in-process (local Hadoop)
2011-08-02 13:16:30,765 null map = 0%, reduce = 0%
2011-08-02 13:16:37,776 null map = 100%, reduce = 0%
2011-08-02 13:16:40,915 null map = 100%, reduce = 100%
Ended Job = job_local_0001
Loading data to table default.lpx_partition_test partition (stat_date=null, province=null)
Loading partition {stat_date=20110608, province=GuangDong}
Loading partition {stat_date=20110608, province=ZheJiang}
Loading partition {stat_date=20110609, province=ZheJiang}
Partition default.lpx_partition_test{stat_date=20110608, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Partition default.lpx_partition_test{stat_date=20110608, province=ZheJiang} stats: [num_files: 1, num_rows: 1, total_size: 23]
Partition default.lpx_partition_test{stat_date=20110609, province=ZheJiang} stats: [num_files: 1, num_rows: 2, total_size: 37]
Table default.lpx_partition_test stats: [num_partitions: 7, num_files: 7, num_rows: 12, total_size: 249]
OK
Time taken: 26.672 seconds

hive> select * from lpx_partition_test;
OK
99001 xxxcompany_name1 2011-06-08 GuangDong
99001 xxxcompany_name1 2011-06-08 ZheJiang
99002 xxxcompany_name1 2011-06-08 ZheJiang
99003 xxxcom2 2011-06-08 ZheJiang
99001 xxxcompany_name1 2011-06-09 GuangDong
99001 xxxcompany_name1 2011-06-09 ZheJiang
99002 xxxcompany_name1 2011-06-09 ZheJiang
99003 xxxcom2 2011-06-09 ZheJiang
99001 xxxcompany_name1 20110608 GuangDong
99001 xxxcompany_name1 20110608 ZheJiang
99002 xxxcompany_name1 20110609 ZheJiang
99003 xxxcom2 20110609 ZheJiang
Time taken: 1.179 seconds

为了让分区列的值相同的数据尽量在同一个MapReduce中，这样每一个mapreduce可以尽量少的产生新的文件夹，可以借助distribute by 功能，将分区列值相同的数据放在一起。

Ref:http://blog.csdn.net/lpxuan151009/article/details/6653514

很不错的常见操作，总结的不错！

hive> create table lpx_partition_test(global_id int, company_name string)partitioned by (stat_date string, province string) row format delimited fields terminated by ',';

Time taken: 0.114 seconds

由此可见hive sql中的分区列并不是一个实际存在的列，可以说是一个或多个伪列。