大數據Hive學習案例(1)——基於搜狗sogou_500w的日誌數據分析

2020-08-08 17:06:00

數據預處理

檢視數據

[hadoop@hadoop000 hive_data]$ less sogou.500w.utf8
20111230000005  57375476989eea12893c0c3811607bcf        奇藝高清        1       1       http://www.qiyi.com/
20111230000005  66c5bb7774e31d0a22278249b26bc83a        凡人修仙傳      3       1       http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1
20111230000007  b97920521c78de70ac38e3713f524b50        本本聯盟        1       1       http://www.bblianmeng.com/

[hadoop@hadoop000 hive_data]$ wc -l sogou.500w.utf8
5000000 sogou.500w.utf8

數據擴充套件

主要目的:將第一列的‘時間’進行substr操作,分成年,月,日,時這四列,加到數據的後面,方便後面進行分割區。

[hadoop@hadoop000 hive_data]$ vi sogou-log-extend.sh
#!/bin/bash
#infile=/sogou_500w.utf8
infile=$1
#outfile=/sogou_500w.utf8.ext
outfile=$2
awk -F '\t' '{print $0 "\t" substr($1,1,4) "\t" substr($1,5,2) "\t" substr($1,7,2) "\t" substr($1,9,2)}' $infile > $outfile

[hadoop@hadoop000 hive_data]$ bash sogou-log-extend.sh /home/hadoop/data/hive_data/sogou.500w.utf8 /home/hadoop/data/hive_data/sogou.500w.utf8.ext

[hadoop@hadoop000 hive_data]$ less sogou.500w.utf8.ext
20111230000005  57375476989eea12893c0c3811607bcf        奇藝高清        1       1       http://www.qiyi.com/    2011    12      30      00
20111230000005  66c5bb7774e31d0a22278249b26bc83a        凡人修仙傳      3       1       http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1   2011    12      30      00
20111230000007  b97920521c78de70ac38e3713f524b50        本本聯盟        1       1       http://www.bblianmeng.com/      2011    12      30      00

數據載入

將數據載入到hdfs

hadoop fs -mkdir -p /sogou/20111230
hadoop fs -put  /home/hadoop/data/hive_data/sogou.500w.utf8  /sogou/20111230/
hadoop fs -mkdir -p /sogou_ext/20111230
hadoop fs -put  /home/hadoop/data/hive_data/sogou.500w.utf8.ext  /sogou_ext/20111230/

構建數據倉庫

建立外部表

create external table if not exists sogou.sogou_20111230(
ts string,
uid string,
keyword string,
rank int,
order int,
url string)
comment 'This is the sogou search data of one day'
row format delimited
fields terminated by '\t'
stored as textfile
location '/sogou/20111230';

建立分割區表

create external table if not exists sogou.sogou_ext_20111230(
ts string,
uid string,
keyword string,
rank int,
order int,
url string,
year int,
month int,
day int,
hour int)
comment 'This is the sogou search data of extend'
row format delimited
fields terminated by '\t'
stored as textfile
location '/sogou_ext/20111230';

create external table if not exists sogou.sogou_partition(
ts string,
uid string,
keyword string,
rank int,
order int,
url string)
comment 'This is the sogou search data by partition'
partitioned by(
year INT, month INT, day INT, hour INT)
row format delimited
fields terminated by '\t'
stored as textfile;

set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table sogou.sogou_partition partition(year, month, day, hour) select * from sogou.sogou_ext_20111230;

select * from sogou_partition limit 10;

數據分析需求

條數統計

*數據總數*
select count(*) from sogou_ext_20111230;
*非空查詢條數*
select count(*) from sogou_ext_20111230 where keyword is not null and keyword != ' ';
*無重複條數(根據ts,uid,keyword,url)*
select count(*) from (select ts,uid,keyword,url from sogou_ext_20111230 group by ts,uid,keyword,url  having count(*)=1) B;
*獨立UID條數*
select count(distinct(uid)) from sogou_ext_20111230;

關鍵詞分析

*關鍵詞平均長度統計,關鍵詞中沒有空白字元,則長度爲一*
select avg(a.cnt) from (select size(split(keyword,' s+')) as cnt from sogou_ext_20111230) a;

Total MapReduce CPU Time Spent: 19 seconds 870 msec
OK
1.0012018
Time taken: 28.049 seconds, Fetched: 1 row(s)

*查詢頻度排名*
select  keyword,count(*) as cnt from sogou_ext_20111230 group by keyword order by cnt desc limit 10;

Total MapReduce CPU Time Spent: 43 seconds 440 msec
OK
百度	38441
baidu	18312
人體藝術	14475
4399小遊戲	11438
qq空間	10317
優酷	10158
新亮劍	9654
館陶縣縣長閆寧的父親	9127
公安賣萌	8192
百度一下 你就知道	7505
Time taken: 64.503 seconds, Fetched: 10 row(s)

UID分析

*查詢一次,兩次,三次,大於三次的UID數量*
select sum(if(uids.cnt=1,1,0)),sum(if(uids.cnt=2,1,0)),sum(if(uids.cnt=3,1,0)),sum(if(uids.cnt>3,1,0)) from (select uid,count(*) as cnt from sogou_ext_20111230 group by uid) uids;

Total MapReduce CPU Time Spent: 34 seconds 600 msec
OK
549148	257163	149562	396791
Time taken: 56.256 seconds, Fetched: 1 row(s)

*UID平均查詢次數*
select sum(uids.cnt)/count(uids.uid) from (select uid,count(*) as cnt from sogou_ext_20111230 group by uid) uids;

Total MapReduce CPU Time Spent: 28 seconds 400 msec
OK
3.6964094557111005
Time taken: 49.467 seconds, Fetched: 1 row(s)

*查詢次數大於兩次的使用者總數*
select sum(if(uids.cnt>2,1,0)) from (select uid,count(*) as cnt from sogou_ext_20111230 group by uid) uids;

Total MapReduce CPU Time Spent: 31 seconds 520 msec
OK
546353
Time taken: 51.733 seconds, Fetched: 1 row(s)

*查詢次數大於兩次的使用者所佔比*
分開計算,然後相除。

使用者行爲分析

點選次數與rank之間的關係分析

select count(*) as cnt from sogou_ext_20111230 where rank<11;

Total MapReduce CPU Time Spent: 13 seconds 230 msec
OK
4999869
Time taken: 23.677 seconds, Fetched: 1 row(s)

總數爲500萬,比例爲4999869/5000000,可看出,絕大部分會點選前10條搜尋結果。

直接輸入URL作爲查詢詞的比例

*直接輸入URL查詢的比例*
select count(*) from sogou_ext_20111230 where keyword like '%www%';

Total MapReduce CPU Time Spent: 12 seconds 390 msec
OK
73979
Time taken: 24.717 seconds, Fetched: 1 row(s)

*直接輸入URL查詢並且查詢的URL位於點選的URL中*
select sum(if(instr(url,keyword)>0,1,0)) from (select * from sogou_ext_20111230 where keyword like '%www%') a;

Total MapReduce CPU Time Spent: 12 seconds 600 msec
OK
27561
Time taken: 23.817 seconds, Fetched: 1 row(s)

可看出大部分搜尋URL,並不能得到自己想要的結果。

獨立使用者行爲分析

搜尋個人的行爲,此處略。