In this article, We have explore generating Analytics from a Product based Company using Web Log. Even by using small data, We could still gain a lot of valuable insights.
Problem Statement: Generate Analytics based on the data in Hadoop Eco-system:
- Load weblog data into HDFS using HDFS client
- Develop Pig program to load log and perform analytics on IP Category-1 Category-2 page, status_code
2.1. Count of page views by individual user ie [IP, count(*)]
2.2. Top / Bottom 2: catagery-1/ catagery-2 / page /users (Exclude status code other than 200)
Top 2 and bottom 2 records
- Category, total_number_views
- page, total_number_views
- IP, total_number_of_views
2.3. Total page views / Category wise pageviews / Unique pageviews
- page,total_number_of_views
- category, total_views
- page, total_number_of_unique_views
2.4. Count of status code = 200 / 404 / 400 / 500
- status_code, count
- Load results into tables in MySql Database using Sqoop.
Attribute Information or Dataset Details:
Column Name | Data Type |
---|---|
IP | String |
Category1 | String |
Category2 | String |
page | Integer |
statuscode | Integer |
Data: Input Format – Text
Technology Used
- Apache Hadoop (HDFS)
- Apache Pig
- Apache Hive
- MySQL
- Shell Script
- Apache Sqoop
- Linux
Apache Pig Code
weblog.pig /* Develop Pig program to extract data for the following KPIs */ FILE1 = LOAD '/home/bhavesh/weblog.txt' using PigStorage(',') as (IP:chararray,Category1:chararray,Category2:chararray,page:int,statuscode:int); GRPD = group FILE1 by IP; IPCOUNT = foreach GRPD generate group,COUNT(FILE1.IP); /* Count of page views by individual user ie IP, count(*) */ STORE IPCOUNT into '/home/bhavesh/Count_by_Pageview_by_IndividualUser/' using PigStorage(','); /*(Exclude status code other than 200) */ FILE_STATUS_CODE_200 = FILTER FILE1 by statuscode == 200; /* Top 2 and bottom 2 records Category, total_number_views page, total_number_views IP, total_number_of_views */ GRPD_CATEGORY1 = group FILE_STATUS_CODE_200 by Category1; CNT_FOR_CATEGORY1 = FOREACH GRPD_CATEGORY1 generate group,COUNT(FILE_STATUS_CODE_200.Category1) as COUNTING; SORTED_CATEGORY1_DEC = ORDER CNT_FOR_CATEGORY1 by COUNTING DESC; TOP_2_CATEGORY1 = limit SORTED_CATEGORY1_DEC 2; STORE TOP_2_CATEGORY1 into '/home/bhavesh/Top2_By_Category1/' using PigStorage(','); SORTED_CATEGORY1_ASC = ORDER CNT_FOR_CATEGORY1 by COUNTING ASC; BOTTOM_2_CATEGORY1 = limit SORTED_CATEGORY1_ASC 2; STORE BOTTOM_2_CATEGORY1 into '/home/bhavesh/Bottom2_By_Category1/' using PigStorage(','); GRPD_CATEGORY2 = group FILE_STATUS_CODE_200 by Category2; CNT_FOR_CATEGORY2 = FOREACH GRPD_CATEGORY2 generate group,COUNT(FILE_STATUS_CODE_200.Category2) as COUNTING; SORTED_CATEGORY2_DEC = ORDER CNT_FOR_CATEGORY2 by COUNTING DESC; TOP_2_CATEGORY2 = limit SORTED_CATEGORY2_DEC 2; STORE TOP_2_CATEGORY2 into '/home/bhavesh/Top2_By_Category2/' using PigStorage(','); SORTED_CATEGORY2_ASC = ORDER CNT_FOR_CATEGORY2 by COUNTING ASC; BOTTOM_2_CATEGORY2 = limit SORTED_CATEGORY2_ASC 2; STORE BOTTOM_2_CATEGORY2 into '/home/bhavesh/Bottom2_By_Category2/' using PigStorage(','); GRPD_PAGES = group FILE_STATUS_CODE_200 by page; CNT_FOR_PAGE = FOREACH GRPD_PAGES generate group,COUNT(FILE_STATUS_CODE_200.page) as COUNTING; SORTED_PAGE_DEC = ORDER CNT_FOR_PAGE by COUNTING DESC; TOP_2_PAGE = limit SORTED_PAGE_DEC 2; STORE TOP_2_PAGE into '/home/bhavesh/Top2_By_PAGE/' using PigStorage(','); SORTED_PAGE_ASC = ORDER CNT_FOR_PAGE by COUNTING ASC; BOTTOM_2_PAGE = limit SORTED_PAGE_ASC 2; STORE BOTTOM_2_PAGE into '/home/bhavesh/Bottom2_By_PAGE/' using PigStorage(','); GRPD_IP = group FILE_STATUS_CODE_200 by IP; CNT_FOR_IP = FOREACH GRPD_IP generate group,COUNT(FILE_STATUS_CODE_200.IP) as COUNTING; SORTED_IP_DEC = ORDER CNT_FOR_IP by COUNTING DESC; TOP_2_IP = limit SORTED_IP_DEC 2; STORE TOP_2_IP into '/home/bhavesh/Top2_By_IP/' using PigStorage(','); SORTED_IP_ASC = ORDER CNT_FOR_IP by COUNTING ASC; BOTTOM_2_IP = limit SORTED_IP_ASC 2; STORE BOTTOM_2_IP into '/home/bhavesh/Bottom2_By_IP/' using PigStorage(','); /* Total page views / Category wise pageviews / Unique pageviews page,total_number_of_views category, total_views page, total_number_of_unique_views */ GRPD_TOTALPAGES = group FILE1 by page; CNT_FOR_TOTALPAGE = FOREACH GRPD_TOTALPAGES generate group,COUNT(FILE1.page) as COUNTING; SORTED_PAGES_DEC = ORDER CNT_FOR_TOTALPAGE by COUNTING DESC; STORE SORTED_PAGES_DEC into '/home/bhavesh/Total_PAGES_Count/' using PigStorage(','); GRPD_TOTALCATEGORY1 = group FILE1 by Category1; CNT_FOR_TOTALCATEGORY1 = FOREACH GRPD_TOTALCATEGORY1 generate group,COUNT(FILE1.page) as COUNTING; SORTED_TOTALCATEGORY1_DEC = ORDER CNT_FOR_TOTALCATEGORY1 by COUNTING DESC; STORE SORTED_TOTALCATEGORY1_DEC into '/home/bhavesh/Total_Category1_Count/' using PigStorage(','); GRPD_TOTALCATEGORY2 = group FILE1 by Category2; CNT_FOR_TOTALCATEGORY2 = FOREACH GRPD_TOTALCATEGORY2 generate group,COUNT(FILE1.page) as COUNTING; SORTED_TOTALCATEGORY2_DEC = ORDER CNT_FOR_TOTALCATEGORY2 by COUNTING DESC; STORE SORTED_TOTALCATEGORY2_DEC into '/home/bhavesh/Total_Category2_Count/' using PigStorage(','); GRPD_TOTALPAGES_UNIQUEVIEW = group FILE1 by page; CNT_FOR_TOTALPAGE_UNIQUEVIEW = FOREACH GRPD_TOTALPAGES_UNIQUEVIEW { internet_protocol = FILE1.IP; unique_internet_protocol = DISTINCT internet_protocol; GENERATE group, COUNT(unique_internet_protocol); }; STORE CNT_FOR_TOTALPAGE_UNIQUEVIEW into '/home/bhavesh/Page_Total_Number_Of_Unique_Views/' using PigStorage(','); /* Count of status code = 200 / 404 / 400 / 500 status_code, count */ GRPD = group FILE1 by statuscode; STATUS_CODE_COUNT = foreach GRPD generate group,COUNT(FILE1.statuscode); STORE STATUS_CODE_COUNT into '/home/bhavesh/Status_Code_Count/' using PigStorage(',');
Shell Script
####################################################################### ############################# COMPLETE SCRIPT ####################### ### HEADER - PROGRAM NAME - <weblog.sh> ### AUTHOR - BHAVESH BHADRICHA ### DATE - 11/DEC/2015 ### VERSION - 1.0 ### DESCRIPTION - Data: It comprises of the information gathered from websites ### which contains IP, Two Categories of Product, Pages and Status Code ### ### ### Problem Statement: Analyse the data in Hadoop Eco-system to: ### 1.Load data into HDFS using HDFS client ### ### ### 2. Develop PIG program to parse WEB logs and meaning full result from it ### INUPT file Format ### IP,Category-1,Category-2,page status_code ### ### ### PIG program to extract data for the following ### ### ### 3. Count of page views by individual user ### IP, count(*) ### ### ### ### 4. Top / Bottom 5: catagery-1/ catagery-2 / page /users ### (Exclude status code other than 200) ### ### ### Top 5 and bottom 5 records ### ### Category, total_number_views ### ### page, total_number_views ### ### IP, total_number_of_views ### ### ### ### 5. Total page views / Category wise pageviews / Unique pageviews ### ### ### ### page,total_number_of_views ### ### category, total_views ### ### page, total_number_of_unique_views ### ### ### ### 6. Count of status code = 200 / 404 / 400 / 500 ### ### status_code, count ### ### ### ### 7. Load results into tables in MySql Database using Sqoop.? ### ##################################################################### ##################################################################### ##################################################################### ###DEFINING THE LOCAL VARIABLES### ###################################################################### DATE=$(date +"%Y%m%d_%H%M%S") LOGFILE="/home/bhavesh/POC/WEBLOG_POC/LOG/"$DATE".log" ##################################################################### ### Load data into HDFS using HDFS client ######################### ##################################################################### hadoop fs -put weblog.txt /home/bhavesh/weblog.txt ########################## PIG Processing ############################### #### PIG, which splits the data into two parts: Category data and Ratings data ### ######################################################################### echo "Pig Script starts here" echo "PIG Script,Weblog Processing" >> $LOGFILE hadoop fs -rmr /home/bhavesh/Total_Category1_Count hadoop fs -rmr /home/bhavesh/Bottom2_By_PAGE hadoop fs -rmr /home/bhavesh/Top2_By_Category1 hadoop fs -rmr /home/bhavesh/Top2_By_IP hadoop fs -rmr /home/bhavesh/Total_PAGES_Count hadoop fs -rmr /home/bhavesh/Bottom2_By_IP hadoop fs -rmr /home/bhavesh/Bottom2_By_Category2 hadoop fs -rmr /home/bhavesh/Status_Code_Count hadoop fs -rmr /home/bhavesh/Count_by_Pageview_by_IndividualUser hadoop fs -rmr /home/bhavesh/Page_Total_Number_Of_Unique_Views hadoop fs -rmr /home/bhavesh/Total_Category2_Count hadoop fs -rmr /home/bhavesh/Bottom2_By_Category1 hadoop fs -rmr /home/bhavesh/Top2_By_Category2 hadoop fs -rmr /home/bhavesh/Top2_By_PAGE pig /home/bhavesh/POC/WEBLOG_POC/weblog.pig if [ $? -eq 0 ]; then echo "Succesfully finished PIG Processing " >> $LOGFILE else echo "PIG Processing Failed Please check the Log " >> $LOGFILE fi ############################ HIVE Processing ############################### ###### HIVE will load the Category data and Rating Data into Hive Tables ########## ############################################################################# echo "HIVE Script starts here" echo "HIVE LOAD data into Table " >> $LOGFILE hive -e 'drop table if exists TotalCategory1Count'; hive -e 'drop table if exists Bottom2ByPAGE'; hive -e 'drop table if exists Top2ByCategory1'; hive -e 'drop table if exists Top2ByIP'; hive -e 'drop table if exists TotalPAGESCount'; hive -e 'drop table if exists Bottom2ByIP'; hive -e 'drop table if exists Bottom2ByCategory2'; hive -e 'drop table if exists StatusCodeCount'; hive -e 'drop table if exists CountbyPageviewbyIndividualUser'; hive -e 'drop table if exists PageTotalNumberOfUniqueViews'; hive -e 'drop table if exists TotalCategory2Count'; hive -e 'drop table if exists Bottom2ByCategory1'; hive -e 'drop table if exists Top2ByCategory2'; hive -e 'drop table if exists Top2ByPAGE'; hive -e "create external table TotalCategory1Count (Category1 string, countings int) row format delimited fields terminated by',' lines terminated by '\n' stored as textfile location '/home/bhavesh/hive/TotalCategory1Count'"; hive -e "create external table Bottom2ByPAGE (Pages int, countings int) row format delimited fields terminated by',' lines terminated by '\n' stored as textfile location '/home/bhavesh/hive/Bottom2ByPAGE'"; hive -e "create external table Top2ByCategory1 (Category1 string, countings int) row format delimited fields terminated by',' lines terminated by '\n' stored as textfile location '/home/bhavesh/hive/Top2ByCategory1'"; hive -e "create external table Top2ByIP (IP string, countings int) row format delimited fields terminated by',' lines terminated by '\n' stored as textfile location '/home/bhavesh/hive/Top2ByIP'"; hive -e "create external table TotalPAGESCount (PAGES int, countings int) row format delimited fields terminated by',' lines terminated by '\n' stored as textfile location '/home/bhavesh/hive/TotalPAGESCount'"; hive -e "create external table Bottom2ByIP (IP string, countings int) row format delimited fields terminated by',' lines terminated by '\n' stored as textfile location '/home/bhavesh/hive/Bottom2ByIP'"; hive -e "create external table Bottom2ByCategory2 (Category2 string, countings int) row format delimited fields terminated by',' lines terminated by '\n' stored as textfile location '/home/bhavesh/hive/Bottom2ByCategory2'"; hive -e "create external table StatusCodeCount (StatusCode int, countings int) row format delimited fields terminated by',' lines terminated by '\n' stored as textfile location '/home/bhavesh/hive/StatusCodeCount'"; hive -e "create external table CountbyPageviewbyIndividualUser (IP string, countings int) row format delimited fields terminated by',' lines terminated by '\n' stored as textfile location '/home/bhavesh/hive/CountbyPageviewbyIndividualUser'"; hive -e "create external table PageTotalNumberOfUniqueViews (page int, countings int) row format delimited fields terminated by',' lines terminated by '\n' stored as textfile location '/home/bhavesh/hive/PageTotalNumberOfUniqueViews'"; hive -e "create external table TotalCategory2Count (Category2 string, countings int) row format delimited fields terminated by',' lines terminated by '\n' stored as textfile location '/home/bhavesh/hive/TotalCategory2Count'"; hive -e "create external table Bottom2ByCategory1 (Category1 string, countings int) row format delimited fields terminated by',' lines terminated by '\n' stored as textfile location '/home/bhavesh/hive/Bottom2ByCategory1'"; hive -e "create external table Top2ByCategory2 (Category2 string, countings int) row format delimited fields terminated by',' lines terminated by '\n' stored as textfile location '/home/bhavesh/hive/Top2ByCategory2'"; hive -e "create external table Top2ByPAGE (page int, countings int) row format delimited fields terminated by',' lines terminated by '\n' stored as textfile location '/home/bhavesh/hive/Top2ByPAGE'"; hive -e "load data inpath '/home/bhavesh/Total_Category1_Count/part-r-00000' overwrite into table TotalCategory1Count"; hive -e "load data inpath '/home/bhavesh/Bottom2_By_PAGE/part-r-00000' overwrite into table Bottom2ByPAGE"; hive -e "load data inpath '/home/bhavesh/Top2_By_Category1/part-r-00000' overwrite into table Top2ByCategory1"; hive -e "load data inpath '/home/bhavesh/Top2_By_IP/part-r-00000' overwrite into table Top2ByIP"; hive -e "load data inpath '/home/bhavesh/Total_PAGES_Count/part-r-00000' overwrite into table TotalPAGESCount"; hive -e "load data inpath '/home/bhavesh/Bottom2_By_IP/part-r-00000' overwrite into table Bottom2ByIP"; hive -e "load data inpath '/home/bhavesh/Bottom2_By_Category2/part-r-00000' overwrite into table Bottom2ByCategory2"; hive -e "load data inpath '/home/bhavesh/Status_Code_Count/part-r-00000' overwrite into table StatusCodeCount"; hive -e "load data inpath '/home/bhavesh/Count_by_Pageview_by_IndividualUser/part-r-00000' overwrite into table CountbyPageviewbyIndividualUser"; hive -e "load data inpath '/home/bhavesh/Page_Total_Number_Of_Unique_Views/part-r-00000' overwrite into table PageTotalNumberOfUniqueViews"; hive -e "load data inpath '/home/bhavesh/Total_Category2_Count/part-r-00000' overwrite into table TotalCategory2Count"; hive -e "load data inpath '/home/bhavesh/Bottom2_By_Category1/part-r-00000' overwrite into table Bottom2ByCategory1"; hive -e "load data inpath '/home/bhavesh/Top2_By_Category2/part-r-00000' overwrite into table Top2ByCategory2"; hive -e "load data inpath '/home/bhavesh/Top2_By_PAGE/part-r-00000' overwrite into table Top2ByPAGE"; ############################ SQOOP Processing ############################## ###### Pushing the HIVE Tale data into RDBMS Tables via SQOOP ####################### ######################################################################### sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table TotalCategory1Count --export-dir /home/bhavesh/hive/TotalCategory1Count/part-r-00000 --input-fields-terminated-by ','; sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByPAGE --export-dir /home/bhavesh/hive/Bottom2ByPAGE/part-r-00000 --input-fields-terminated-by ','; sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByCategory1 --export-dir /home/bhavesh/hive/Top2ByCategory1/part-r-00000 --input-fields-terminated-by ','; sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByIP --export-dir /home/bhavesh/hive/Top2ByIP/part-r-00000 --input-fields-terminated-by ','; sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table TotalPAGESCount --export-dir /home/bhavesh/hive/TotalPAGESCount/part-r-00000 --input-fields-terminated-by ','; sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByIP --export-dir /home/bhavesh/hive/Bottom2ByIP/part-r-00000 --input-fields-terminated-by ','; sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByCategory2 --export-dir /home/bhavesh/hive/Bottom2ByCategory2/part-r-00000 --input-fields-terminated-by ','; sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table StatusCodeCount --export-dir /home/bhavesh/hive/StatusCodeCount/part-r-00000 --input-fields-terminated-by ','; sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table CountbyPageviewbyIndividualUser --export-dir /home/bhavesh/hive/CountbyPageviewbyIndividualUser/part-r-00000 --input-fields-terminated-by ','; sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table PageTotalNumberOfUniqueViews --export-dir /home/bhavesh/hive/PageTotalNumberOfUniqueViews/part-r-00000 --input-fields-terminated-by ','; sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table TotalCategory2Count --export-dir /home/bhavesh/hive/TotalCategory2Count/part-r-00000 --input-fields-terminated-by ','; sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByCategory1 --export-dir /home/bhavesh/hive/Bottom2ByCategory1/part-r-00000 --input-fields-terminated-by ','; sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByCategory2 --export-dir /home/bhavesh/hive/Top2ByCategory2/part-r-00000 --input-fields-terminated-by ','; sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByPAGE --export-dir /home/bhavesh/hive/Top2ByPAGE/part-r-00000 --input-fields-terminated-by ',';