`
nanjingjiangbiao_T
  • 浏览: 2599925 次
  • 来自: 深圳
文章分类
社区版块
存档分类
最新评论

hive sql分区表

 
阅读更多

很不错的常见操作,总结的不错!

hive> create table lpx_partition_test(global_id int, company_name string)partitioned by (stat_date string, province string) row format delimited fields terminated by ',';

OK

Time taken: 0.114 seconds

由此可见hive sql中的分区列并不是一个实际存在的列,可以说是一个或多个伪列。

hive> desc extended lpx_partition_test;
OK
global_id int
company_name string
stat_date string
province string

Detailed Table Information Table(tableName:lpx_partition_test, dbName:default, owner:root, createTime:1312186275, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:global_id, type:int, comment:null), FieldSchema(name:company_name, type:string, comment:null), FieldSchema(name:stat_date, type:string, comment:null), FieldSchema(name:province, type:string, comment:null)], location:hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[FieldSchema(name:stat_date, type:string, comment:null), FieldSchema(name:province, type:string, comment:null)], parameters:{transient_lastDdlTime=1312186275}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.111 seconds

该例子中创建了stat_date和province作为分区列。和oracle 中类似,要先创建了分区,才可以插入数据。分区创建成功后在hdfs上会创建对应的文件。

hive> alter table lpx_partition_test add PARTITION(stat_date='2011-06-08', province='ZheJiang');
OK
Time taken: 0.464 seconds

hive> alter table lpx_partition_test add PARTITION(stat_date='2011-06-08', province='GuangDong');
OK
Time taken: 7.746 seconds
hive> alter table lpx_partition_test add PARTITION(stat_date='2011-06-09', province='ZheJiang');
OK
Time taken: 0.235 seconds

root@hadoop1:/opt/hadoop# bin/hadoop dfs -ls /user/hive/warehouse/lpx_partition_test
Found 2 items
drwxr-xr-x - root supergroup 0 2011-08-01 16:42 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08
drwxr-xr-x - root supergroup 0 2011-08-01 16:42 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-09

root@hadoop1:/opt/hadoop# bin/hadoop dfs -ls /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08
Found 2 items
drwxr-xr-x - root supergroup 0 2011-08-01 16:42 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=GuangDong
drwxr-xr-x - root supergroup 0 2011-08-01 16:37 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang

由此可见,每个分区都有一个独立的文件对应,stat_date位于父层级,province位于子层级。

向分区中插入数据:
hive> drop table lpx_partition_test_in;
OK
Time taken: 6.971 seconds
hive> create table lpx_partition_test_in(global_id int, company_name string, province string)row format delimited fields terminated by ' ';
OK
Time taken: 0.275 seconds
hive> LOAD DATA LOCAL INPATH '/opt/hadoop/mytest/lpx_partition_test.txt' OVERWRITE INTO TABLE lpx_partition_test_in;
Copying data from file:/opt/hadoop/mytest/lpx_partition_test.txt
Copying file: file:/opt/hadoop/mytest/lpx_partition_test.txt
Loading data to table default.lpx_partition_test_in
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test_in
OK
Time taken: 0.428 seconds

hive> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province='ZheJiang') select global_id, company_name from lpx_partition_test_in where province='ZheJiang';
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/root/root_20110801172929_4b36ae2a-9d00-4432-8746-7b4d62aa8378.log
Job running in-process (local Hadoop)
2011-08-01 17:29:35,384 null map = 100%, reduce = 0%
Ended Job = job_local_0001
Ended Job = -1620577194, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_17-29-30_013_2844131263666576737/-ext-10000
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=ZheJiang)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang
Partition default.lpx_partition_test{stat_date=2011-06-08, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 1, num_files: 1, num_rows: 3, total_size: 60]
OK
Time taken: 6.524 seconds

hive> select * from lpx_partition_test;
OK
99001 xxxcompany_name1 2011-06-08 ZheJiang
99002 xxxcompany_name1 2011-06-08 ZheJiang
99003 xxxcom2 2011-06-08 ZheJiang
Time taken: 0.559 seconds

hive> from lpx_partition_test_in
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province='ZheJiang') select global_id, company_name where province='ZheJiang'
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province='GuangDong') select global_id, company_name where province='GuangDong'
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-09', province='ZheJiang') select global_id, company_name where province='ZheJiang'
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-09', province='GuangDong') select global_id, company_name where province='GuangDong';
Total MapReduce jobs = 5
Launching Job 1 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/root/root_20110801180606_1dc94690-8e64-41cc-a4d7-30e927408f30.log
Job running in-process (local Hadoop)
2011-08-01 18:06:22,147 null map = 0%, reduce = 0%
2011-08-01 18:06:23,149 null map = 100%, reduce = 0%
Ended Job = job_local_0001
Ended Job = 1501179483, job is filtered out (removed at runtime).
Ended Job = -24922011, job is filtered out (removed at runtime).
Ended Job = -2114178998, job is filtered out (removed at runtime).
Ended Job = 1437573638, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10000
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10002
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10004
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10006
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=ZheJiang)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang
Partition default.lpx_partition_test{stat_date=2011-06-08, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 1, num_files: 1, num_rows: 3, total_size: 60]
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-09, province=ZheJiang)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-09/province=ZheJiang
Partition default.lpx_partition_test{stat_date=2011-06-09, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 2, num_files: 2, num_rows: 6, total_size: 120]
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=GuangDong)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=GuangDong
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-09, province=GuangDong)
Partition default.lpx_partition_test{stat_date=2011-06-09, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Table default.lpx_partition_test stats: [num_partitions: 3, num_files: 3, num_rows: 7, total_size: 143]
Partition default.lpx_partition_test{stat_date=2011-06-08, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Table default.lpx_partition_test stats: [num_partitions: 4, num_files: 4, num_rows: 8, total_size: 166]
OK
Time taken: 8.778 seconds

hive> select * from lpx_partition_test;
OK
99001 xxxcompany_name1 2011-06-08 GuangDong
99001 xxxcompany_name1 2011-06-08 ZheJiang
99002 xxxcompany_name1 2011-06-08 ZheJiang
99003 xxxcom2 2011-06-08 ZheJiang
99001 xxxcompany_name1 2011-06-09 GuangDong
99001 xxxcompany_name1 2011-06-09 ZheJiang
99002 xxxcompany_name1 2011-06-09 ZheJiang
99003 xxxcom2 2011-06-09 ZheJiang
Time taken: 0.356 seconds

--动态分区
如果有大量的数据需要插入到不同的分区,需要对每一个分区都写一条insert语句,必须使用大量的insert语句。
为了加载某一天各个省份的分区数据,必须为每一个省份写一条insert语句,使用起来非常不方便。如果需要插入另外一天的数据,必须要修改DML语句和DDL语句,而且每个insert语句转化为MapReduce任务也非常不方便。动态分区,用来解决这个问题,它可以根据输入数据来动态决定如何创建哪个区分和将数据放入哪个分区。这个特性从0.6.0版开始有。在动态插入过程中,输入列值被评估并动态地决定要插入的分区。如果相应的分区没有被创建,它会自动地创建该分区。使用这个特性,我们可以使用一条insert语句来创建和写入所有必要的分区。另外,因为只有一条sql语句,所以对应地只有一个MapReduce任务。这可以极大地提升性能并且减少Hadoop聚类的负载。

动态分区参数:
hive.exec.max.dynamic.partitions.pernode:每个mapper or reducer创建动态分区的最大数量,小于100.
hive.exec.max.dynamic.partitions:每个DML语句可以创建的动态分区的数量,小于1000.
hive.exec.max.created.files:所有smapper or reducer创建文件的最大数量,小于100000.

hive> set hive.exec.dynamic.partition;
hive.exec.dynamic.partition=false
hive> set hive.exec.dynamic.partition = true;
hive> set hive.exec.dynamic.partition;
hive.exec.dynamic.partition=true

hive> from lpx_partition_test_in
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province) select global_id, company_name,province;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/root/root_20110801183737_64ce8cf1-a068-4fbf-9d8e-561118569b2c.log
Job running in-process (local Hadoop)
2011-08-01 18:37:57,566 null map = 100%, reduce = 0%
Ended Job = job_local_0001
Ended Job = -1141443727, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-37-51_921_8609501383674778354/-ext-10000
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=null)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=GuangDong
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang
Loading partition {stat_date=2011-06-08, province=GuangDong}
Loading partition {stat_date=2011-06-08, province=ZheJiang}
Partition default.lpx_partition_test{stat_date=2011-06-08, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Partition default.lpx_partition_test{stat_date=2011-06-08, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 4, num_files: 4, num_rows: 8, total_size: 166]
OK
Time taken: 6.683 seconds

hive> from lpx_partition_test_in
> insert overwrite table lpx_partition_test PARTITION(stat_date, province) select global_id, company_name,stat_date,province;
FAILED: Error in semantic analysis: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> from lpx_partition_test_in t
> insert overwrite table lpx_partition_test PARTITION(stat_date, province) select t.global_id, t.company_name, t.stat_date, t.province DISTRIBUTE BY t.stat_date, t.province;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Execution log at: /tmp/root/root_20110802131616_02744950-1c88-4073-8aae-07c964073c1a.log
Job running in-process (local Hadoop)
2011-08-02 13:16:30,765 null map = 0%, reduce = 0%
2011-08-02 13:16:37,776 null map = 100%, reduce = 0%
2011-08-02 13:16:40,915 null map = 100%, reduce = 100%
Ended Job = job_local_0001
Loading data to table default.lpx_partition_test partition (stat_date=null, province=null)
Loading partition {stat_date=20110608, province=GuangDong}
Loading partition {stat_date=20110608, province=ZheJiang}
Loading partition {stat_date=20110609, province=ZheJiang}
Partition default.lpx_partition_test{stat_date=20110608, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Partition default.lpx_partition_test{stat_date=20110608, province=ZheJiang} stats: [num_files: 1, num_rows: 1, total_size: 23]
Partition default.lpx_partition_test{stat_date=20110609, province=ZheJiang} stats: [num_files: 1, num_rows: 2, total_size: 37]
Table default.lpx_partition_test stats: [num_partitions: 7, num_files: 7, num_rows: 12, total_size: 249]
OK
Time taken: 26.672 seconds

hive> select * from lpx_partition_test;
OK
99001 xxxcompany_name1 2011-06-08 GuangDong
99001 xxxcompany_name1 2011-06-08 ZheJiang
99002 xxxcompany_name1 2011-06-08 ZheJiang
99003 xxxcom2 2011-06-08 ZheJiang
99001 xxxcompany_name1 2011-06-09 GuangDong
99001 xxxcompany_name1 2011-06-09 ZheJiang
99002 xxxcompany_name1 2011-06-09 ZheJiang
99003 xxxcom2 2011-06-09 ZheJiang
99001 xxxcompany_name1 20110608 GuangDong
99001 xxxcompany_name1 20110608 ZheJiang
99002 xxxcompany_name1 20110609 ZheJiang
99003 xxxcom2 20110609 ZheJiang
Time taken: 1.179 seconds

为了让分区列的值相同的数据尽量在同一个MapReduce中,这样每一个mapreduce可以尽量少的产生新的文件夹,可以借助distribute by 功能,将分区列值相同的数据放在一起。

Ref:http://blog.csdn.net/lpxuan151009/article/details/6653514

很不错的常见操作,总结的不错!

hive> create table lpx_partition_test(global_id int, company_name string)partitioned by (stat_date string, province string) row format delimited fields terminated by ',';

OK

Time taken: 0.114 seconds

由此可见hive sql中的分区列并不是一个实际存在的列,可以说是一个或多个伪列。

hive> desc extended lpx_partition_test;
OK
global_id int
company_name string
stat_date string
province string

Detailed Table Information Table(tableName:lpx_partition_test, dbName:default, owner:root, createTime:1312186275, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:global_id, type:int, comment:null), FieldSchema(name:company_name, type:string, comment:null), FieldSchema(name:stat_date, type:string, comment:null), FieldSchema(name:province, type:string, comment:null)], location:hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[FieldSchema(name:stat_date, type:string, comment:null), FieldSchema(name:province, type:string, comment:null)], parameters:{transient_lastDdlTime=1312186275}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.111 seconds

该例子中创建了stat_date和province作为分区列。和oracle 中类似,要先创建了分区,才可以插入数据。分区创建成功后在hdfs上会创建对应的文件。

hive> alter table lpx_partition_test add PARTITION(stat_date='2011-06-08', province='ZheJiang');
OK
Time taken: 0.464 seconds

hive> alter table lpx_partition_test add PARTITION(stat_date='2011-06-08', province='GuangDong');
OK
Time taken: 7.746 seconds
hive> alter table lpx_partition_test add PARTITION(stat_date='2011-06-09', province='ZheJiang');
OK
Time taken: 0.235 seconds

root@hadoop1:/opt/hadoop# bin/hadoop dfs -ls /user/hive/warehouse/lpx_partition_test
Found 2 items
drwxr-xr-x - root supergroup 0 2011-08-01 16:42 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08
drwxr-xr-x - root supergroup 0 2011-08-01 16:42 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-09

root@hadoop1:/opt/hadoop# bin/hadoop dfs -ls /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08
Found 2 items
drwxr-xr-x - root supergroup 0 2011-08-01 16:42 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=GuangDong
drwxr-xr-x - root supergroup 0 2011-08-01 16:37 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang

由此可见,每个分区都有一个独立的文件对应,stat_date位于父层级,province位于子层级。

向分区中插入数据:
hive> drop table lpx_partition_test_in;
OK
Time taken: 6.971 seconds
hive> create table lpx_partition_test_in(global_id int, company_name string, province string)row format delimited fields terminated by ' ';
OK
Time taken: 0.275 seconds
hive> LOAD DATA LOCAL INPATH '/opt/hadoop/mytest/lpx_partition_test.txt' OVERWRITE INTO TABLE lpx_partition_test_in;
Copying data from file:/opt/hadoop/mytest/lpx_partition_test.txt
Copying file: file:/opt/hadoop/mytest/lpx_partition_test.txt
Loading data to table default.lpx_partition_test_in
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test_in
OK
Time taken: 0.428 seconds

hive> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province='ZheJiang') select global_id, company_name from lpx_partition_test_in where province='ZheJiang';
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/root/root_20110801172929_4b36ae2a-9d00-4432-8746-7b4d62aa8378.log
Job running in-process (local Hadoop)
2011-08-01 17:29:35,384 null map = 100%, reduce = 0%
Ended Job = job_local_0001
Ended Job = -1620577194, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_17-29-30_013_2844131263666576737/-ext-10000
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=ZheJiang)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang
Partition default.lpx_partition_test{stat_date=2011-06-08, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 1, num_files: 1, num_rows: 3, total_size: 60]
OK
Time taken: 6.524 seconds

hive> select * from lpx_partition_test;
OK
99001 xxxcompany_name1 2011-06-08 ZheJiang
99002 xxxcompany_name1 2011-06-08 ZheJiang
99003 xxxcom2 2011-06-08 ZheJiang
Time taken: 0.559 seconds

hive> from lpx_partition_test_in
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province='ZheJiang') select global_id, company_name where province='ZheJiang'
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province='GuangDong') select global_id, company_name where province='GuangDong'
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-09', province='ZheJiang') select global_id, company_name where province='ZheJiang'
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-09', province='GuangDong') select global_id, company_name where province='GuangDong';
Total MapReduce jobs = 5
Launching Job 1 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/root/root_20110801180606_1dc94690-8e64-41cc-a4d7-30e927408f30.log
Job running in-process (local Hadoop)
2011-08-01 18:06:22,147 null map = 0%, reduce = 0%
2011-08-01 18:06:23,149 null map = 100%, reduce = 0%
Ended Job = job_local_0001
Ended Job = 1501179483, job is filtered out (removed at runtime).
Ended Job = -24922011, job is filtered out (removed at runtime).
Ended Job = -2114178998, job is filtered out (removed at runtime).
Ended Job = 1437573638, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10000
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10002
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10004
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10006
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=ZheJiang)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang
Partition default.lpx_partition_test{stat_date=2011-06-08, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 1, num_files: 1, num_rows: 3, total_size: 60]
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-09, province=ZheJiang)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-09/province=ZheJiang
Partition default.lpx_partition_test{stat_date=2011-06-09, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 2, num_files: 2, num_rows: 6, total_size: 120]
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=GuangDong)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=GuangDong
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-09, province=GuangDong)
Partition default.lpx_partition_test{stat_date=2011-06-09, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Table default.lpx_partition_test stats: [num_partitions: 3, num_files: 3, num_rows: 7, total_size: 143]
Partition default.lpx_partition_test{stat_date=2011-06-08, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Table default.lpx_partition_test stats: [num_partitions: 4, num_files: 4, num_rows: 8, total_size: 166]
OK
Time taken: 8.778 seconds

hive> select * from lpx_partition_test;
OK
99001 xxxcompany_name1 2011-06-08 GuangDong
99001 xxxcompany_name1 2011-06-08 ZheJiang
99002 xxxcompany_name1 2011-06-08 ZheJiang
99003 xxxcom2 2011-06-08 ZheJiang
99001 xxxcompany_name1 2011-06-09 GuangDong
99001 xxxcompany_name1 2011-06-09 ZheJiang
99002 xxxcompany_name1 2011-06-09 ZheJiang
99003 xxxcom2 2011-06-09 ZheJiang
Time taken: 0.356 seconds

--动态分区
如果有大量的数据需要插入到不同的分区,需要对每一个分区都写一条insert语句,必须使用大量的insert语句。
为了加载某一天各个省份的分区数据,必须为每一个省份写一条insert语句,使用起来非常不方便。如果需要插入另外一天的数据,必须要修改DML语句和DDL语句,而且每个insert语句转化为MapReduce任务也非常不方便。动态分区,用来解决这个问题,它可以根据输入数据来动态决定如何创建哪个区分和将数据放入哪个分区。这个特性从0.6.0版开始有。在动态插入过程中,输入列值被评估并动态地决定要插入的分区。如果相应的分区没有被创建,它会自动地创建该分区。使用这个特性,我们可以使用一条insert语句来创建和写入所有必要的分区。另外,因为只有一条sql语句,所以对应地只有一个MapReduce任务。这可以极大地提升性能并且减少Hadoop聚类的负载。

动态分区参数:
hive.exec.max.dynamic.partitions.pernode:每个mapper or reducer创建动态分区的最大数量,小于100.
hive.exec.max.dynamic.partitions:每个DML语句可以创建的动态分区的数量,小于1000.
hive.exec.max.created.files:所有smapper or reducer创建文件的最大数量,小于100000.

hive> set hive.exec.dynamic.partition;
hive.exec.dynamic.partition=false
hive> set hive.exec.dynamic.partition = true;
hive> set hive.exec.dynamic.partition;
hive.exec.dynamic.partition=true

hive> from lpx_partition_test_in
> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province) select global_id, company_name,province;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/root/root_20110801183737_64ce8cf1-a068-4fbf-9d8e-561118569b2c.log
Job running in-process (local Hadoop)
2011-08-01 18:37:57,566 null map = 100%, reduce = 0%
Ended Job = job_local_0001
Ended Job = -1141443727, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-37-51_921_8609501383674778354/-ext-10000
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=null)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=GuangDong
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang
Loading partition {stat_date=2011-06-08, province=GuangDong}
Loading partition {stat_date=2011-06-08, province=ZheJiang}
Partition default.lpx_partition_test{stat_date=2011-06-08, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Partition default.lpx_partition_test{stat_date=2011-06-08, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 4, num_files: 4, num_rows: 8, total_size: 166]
OK
Time taken: 6.683 seconds

hive> from lpx_partition_test_in
> insert overwrite table lpx_partition_test PARTITION(stat_date, province) select global_id, company_name,stat_date,province;
FAILED: Error in semantic analysis: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> from lpx_partition_test_in t
> insert overwrite table lpx_partition_test PARTITION(stat_date, province) select t.global_id, t.company_name, t.stat_date, t.province DISTRIBUTE BY t.stat_date, t.province;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Execution log at: /tmp/root/root_20110802131616_02744950-1c88-4073-8aae-07c964073c1a.log
Job running in-process (local Hadoop)
2011-08-02 13:16:30,765 null map = 0%, reduce = 0%
2011-08-02 13:16:37,776 null map = 100%, reduce = 0%
2011-08-02 13:16:40,915 null map = 100%, reduce = 100%
Ended Job = job_local_0001
Loading data to table default.lpx_partition_test partition (stat_date=null, province=null)
Loading partition {stat_date=20110608, province=GuangDong}
Loading partition {stat_date=20110608, province=ZheJiang}
Loading partition {stat_date=20110609, province=ZheJiang}
Partition default.lpx_partition_test{stat_date=20110608, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Partition default.lpx_partition_test{stat_date=20110608, province=ZheJiang} stats: [num_files: 1, num_rows: 1, total_size: 23]
Partition default.lpx_partition_test{stat_date=20110609, province=ZheJiang} stats: [num_files: 1, num_rows: 2, total_size: 37]
Table default.lpx_partition_test stats: [num_partitions: 7, num_files: 7, num_rows: 12, total_size: 249]
OK
Time taken: 26.672 seconds

hive> select * from lpx_partition_test;
OK
99001 xxxcompany_name1 2011-06-08 GuangDong
99001 xxxcompany_name1 2011-06-08 ZheJiang
99002 xxxcompany_name1 2011-06-08 ZheJiang
99003 xxxcom2 2011-06-08 ZheJiang
99001 xxxcompany_name1 2011-06-09 GuangDong
99001 xxxcompany_name1 2011-06-09 ZheJiang
99002 xxxcompany_name1 2011-06-09 ZheJiang
99003 xxxcom2 2011-06-09 ZheJiang
99001 xxxcompany_name1 20110608 GuangDong
99001 xxxcompany_name1 20110608 ZheJiang
99002 xxxcompany_name1 20110609 ZheJiang
99003 xxxcom2 20110609 ZheJiang
Time taken: 1.179 seconds

为了让分区列的值相同的数据尽量在同一个MapReduce中,这样每一个mapreduce可以尽量少的产生新的文件夹,可以借助distribute by 功能,将分区列值相同的数据放在一起。

Ref:http://blog.csdn.net/lpxuan151009/article/details/6653514

分享到:
评论

相关推荐

    hive搭建及使用入门简介(内含PPT、各种表创建sql及hive搭建使用笔记)

    适用人群:hive学习童鞋,hive方面从业人员 从hive的搭建 到 hive的分区表/内部表/外部表/分桶等sql讲解

    datax实战-mysql同步数据到hive

    { job: { setting: { speed: { channel: 1 }, errorLimit: { record: 0, percentage: 0.02 } }, content: [ { reader: {

    bdp2hive:生成hive分区表、sqoop导入数据至hive

    bdp2hive项目介绍一、概述项目主要是集成生成sqoop脚本和创建hive分区表组件。生成sqoop脚本组件主要通过传递的数据库信息查询数据库获取表字段与数据集之间的对照关系SQL语句,通过模板拼接成sqoop脚本,上传服务器...

    HIVE 自建落地表

    自建落地表为什么要自建落地表HIVE SQL创建内/外部表、分区表insert overwrite 向分区表插入数据 为什么要自建落地表 1、公司的大数据资源非常紧张, 2、导数逻辑很复杂; 3、日常导数要求时效性高; 4、部门数仓的...

    【63课时完整版】大数据实践HIVE详解及实战

    14.Hive中分区表的创建及使用 15.Hive中数据导入的6种方式及其应用场景 16.Hive中数据导出的4种方式及表的导入导出 17.Hive中HQL的基本语法(一) 18.Hive中HQL的基本语法(二) 19.Hive中order by、sort by、...

    Binlog2Hive:MySQL增量数据实时同步到HDFSHive

    并映射到Hive原理通过解析RDS的binlog将RDS的增量数据同步到HDFS下,并映射加载到Hive外部分区表由于RDS表中的第二个字段都为datetime字段,所以刚才以该字段作为Hive的分区字段配置文件介绍doc/creat table.sql:...

    hive-exec-2.1.1.jar

    Hive Metastore:存储元数据(如表的模式、分区信息等)。 HiveServer2:提供客户端连接和 SQL 查询执行。 Hive Execution Engine:这是实际执行查询的组件,它读取数据,处理查询,并返回结果。 hive-exec-2.1.1 是...

    Hive用户指南(Hive_user_guide)_中文版.pdf

    表的列和分区及其属性,表的属性(是否为外部表等),表的数据所在目录等。 3、 解释器、编译器、优化器完成 HQL 查询语句从词法分析、语法分析、编译、优化以及 查询计划的生成。生成的查询计划存储在 HDFS 中,...

    大数据精选入门指南,包括大数据学习路线、大数据技术栈思维导图

    Hive 分区表和分区表 Hive 视图和索引 Hive 使用 DML 操作 Hive数据详细查询解 三、火花 火花核心: 斯帕克简介 Spark开发环境搭建 弹性式数据集 RDD RDD使用算子详解 Spark运行模式与作业提交 Spark 累加器与广播...

    hive原理1介绍

    hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据...Hive中包含以下数据模型:Table内部表,External Table外部表,Partition分区,Bucket桶。Hive默认可以直接加载文本文件,还支持sequence file 、RCFile。

    hive仓库元数据管理系统

    1.hive元数据信息的查看,包括表基本信息,数据库基本信息,字段信息,分区信息,索引信息等; 2.对hive元数据的检索,包括表、字段、数据库等内容的检索 3.元数据信息更新(目前只提供对库、表、字段的描述信息进行...

    非常好的大数据入门目资源,分享出来.zip

    Hive 分区表和分桶表 Hive 视图和索引 Hive 常用 DML 操作 Hive 数据查询详解 三、Spark Spark Core : Spark 简介 Spark 开发环境搭建 弹性式数据集 RDD RDD 常用算子详解 Spark 运行模式与作业提交 Spark 累加器与...

    flink-connector-hive-2.12-1.13.1.jar

    flink-connector-hive_2.12-1.13.1.jar 是 Apache Flink 的一个 Hive 连接器 JAR...元数据同步:Flink 可以读取 Hive 的元数据,包括数据库、表和分区的信息,从而在 Flink SQL 中直接使用这些表。 数据读取和写入:Fl

    使用SQL访问IBMInfoSphereBigInsights(下)

    针对分区表,我们要为每一个分区分别装载数据,如下所示:我们为sls_product_dim_part分区表装载了product...使用BigInsightsConsole查看分区表目录结构目前,BigSQL暂不支持Hive的Bucketedtable。BigSQL不仅支持基本的

    BigSQL性能增强:Hadoop表分区以及分区消除

    BigSQL,是IBM依托其在RDBMS领域多年的...与市场上其它产品如Hive不同,BigSQL通过在Hadoop 上运行大规模并行处理(MPP)SQL引擎来替代MapReduce,极大地提高了查询速度。Big SQL以其无与伦比的SQL兼容性、丰富的企业及

    Increment_Backup_To_Hive:一个增量备份关系数据库(MySQL, PostgreSQL, SQL Server, SQLite, Oracle等)到hive的php脚本工具

    PostgreSQL, SQL Server, SQLite, Oracle等)到hive的php脚本工具原理由于sqoop可定制性太差,本工具针对增量备份场景,备份某张表时只需要用户填写几个关键参数,就能自动化生成hive表,把脚本加入cron就能实现每天...

    Hadoop+Hive+Spark+Kafka+Zookeeper+Flume+Sqoop+Azkaban+Scala

    分区表和分桶表 视图和索引 常用 DML 操作 数据查询详解 三、Spark Spark Core Spark SQL Spark Streaming 五、Flink 核心概念综述 开发环境搭建 Data Source Data Transformation Data Sink 窗口模型 状态管理与...

    hiveMetaMgr:hive仓库元数据管理系统

    #hive仓库元数据管理系统##有如下功能: 1.hive元数据信息的查看,包括表基本信息,数据库基本信息,分区信息,分区信息,索引信息等; 2.对hive元数据的检索,包括表,分段,数据库等内容的检索3.元数据信息更新...

    学习数据仓库Hive

    数据分析引擎:hive 大数据的终极目标:使用SQL语句来处理大数据  1,hadoop的体系架构中:  两个数据分析引擎:(*)Hive:支持sql ...  分区表 目录  分桶 文件 2,hive是基于hadoop之上的一个数

    Spark SQL常见4种数据源详解

    把DataFrame注册为临时表之后,就可以对该DataFrame执行SQL查询。 Spark SQL的默认数据源为Parquet格式。数据源为Parquet文件时,Spark SQL可以方便的执行所有的操作。 修改配置项spark.sql.sources.default,可修改...

Global site tag (gtag.js) - Google Analytics