Generate Analytics from a Product based Company Web Log Part 1

In this article, We have explore generating Analytics from a Product based Company using Web Log. Even by using small data, We could still gain a lot of valuable insights.

Problem Statement: Generate Analytics based on the data in Hadoop Eco-system:

  1. Load weblog data into HDFS using HDFS client
  2.      Develop Pig program to load log and perform analytics on  IP Category-1 Category-2 page, status_code

2.1.   Count of page views by individual user ie [IP, count(*)]

2.2.  Top / Bottom 2: catagery-1/ catagery-2 / page /users (Exclude status code other than  200)

Top 2 and bottom 2 records

  • Category, total_number_views
  • page, total_number_views
  • IP, total_number_of_views

2.3. Total page views / Category wise pageviews / Unique pageviews

  • page,total_number_of_views
  • category, total_views
  • page, total_number_of_unique_views

2.4. Count of status code = 200 / 404 / 400 / 500

  • status_code, count
  1. Load results into tables in MySql Database using Sqoop.

Attribute Information or Dataset Details:

Column NameData Type
IPString
Category1String
Category2String
pageInteger
statuscodeInteger

Data: Input Format – Text

Technology Used

  1. Apache Hadoop (HDFS)
  2. Apache Pig
  3. Apache Hive
  4. MySQL
  5. Shell Script
  6. Apache Sqoop
  7. Linux

Apache Pig Code

weblog.pig

/* Develop Pig program to extract data for the following KPIs */

FILE1 = LOAD '/home/bhavesh/weblog.txt' using PigStorage(',') as (IP:chararray,Category1:chararray,Category2:chararray,page:int,statuscode:int);
GRPD = group FILE1 by IP;  
IPCOUNT = foreach GRPD generate group,COUNT(FILE1.IP);

/* Count of page views by individual user ie IP, count(*) */
STORE IPCOUNT into '/home/bhavesh/Count_by_Pageview_by_IndividualUser/' using PigStorage(',');

/*(Exclude status code other than 200) */

FILE_STATUS_CODE_200 = FILTER FILE1 by statuscode == 200;

/*
Top 2 and bottom 2 records
Category, total_number_views
page, total_number_views
IP, total_number_of_views
*/

GRPD_CATEGORY1 = group FILE_STATUS_CODE_200 by Category1;
CNT_FOR_CATEGORY1 = FOREACH GRPD_CATEGORY1 generate group,COUNT(FILE_STATUS_CODE_200.Category1) as COUNTING;

SORTED_CATEGORY1_DEC = ORDER CNT_FOR_CATEGORY1 by COUNTING DESC;
TOP_2_CATEGORY1 = limit SORTED_CATEGORY1_DEC 2;
STORE TOP_2_CATEGORY1 into '/home/bhavesh/Top2_By_Category1/' using PigStorage(',');

SORTED_CATEGORY1_ASC = ORDER CNT_FOR_CATEGORY1 by COUNTING ASC;
BOTTOM_2_CATEGORY1 = limit SORTED_CATEGORY1_ASC 2;
STORE BOTTOM_2_CATEGORY1 into '/home/bhavesh/Bottom2_By_Category1/' using PigStorage(',');

GRPD_CATEGORY2 = group FILE_STATUS_CODE_200 by Category2;
CNT_FOR_CATEGORY2 = FOREACH GRPD_CATEGORY2 generate group,COUNT(FILE_STATUS_CODE_200.Category2) as COUNTING;

SORTED_CATEGORY2_DEC = ORDER CNT_FOR_CATEGORY2 by COUNTING DESC;
TOP_2_CATEGORY2 = limit SORTED_CATEGORY2_DEC 2;
STORE TOP_2_CATEGORY2 into '/home/bhavesh/Top2_By_Category2/' using PigStorage(',');

SORTED_CATEGORY2_ASC = ORDER CNT_FOR_CATEGORY2 by COUNTING ASC;
BOTTOM_2_CATEGORY2 = limit SORTED_CATEGORY2_ASC 2;
STORE BOTTOM_2_CATEGORY2 into '/home/bhavesh/Bottom2_By_Category2/' using PigStorage(',');

GRPD_PAGES = group FILE_STATUS_CODE_200 by page;
CNT_FOR_PAGE = FOREACH GRPD_PAGES generate group,COUNT(FILE_STATUS_CODE_200.page) as COUNTING;

SORTED_PAGE_DEC = ORDER CNT_FOR_PAGE by COUNTING DESC;
TOP_2_PAGE = limit SORTED_PAGE_DEC 2;
STORE TOP_2_PAGE into '/home/bhavesh/Top2_By_PAGE/' using PigStorage(',');

SORTED_PAGE_ASC = ORDER CNT_FOR_PAGE by COUNTING ASC;
BOTTOM_2_PAGE = limit SORTED_PAGE_ASC 2;
STORE BOTTOM_2_PAGE into '/home/bhavesh/Bottom2_By_PAGE/' using PigStorage(',');

GRPD_IP = group FILE_STATUS_CODE_200 by IP;
CNT_FOR_IP = FOREACH GRPD_IP generate group,COUNT(FILE_STATUS_CODE_200.IP) as COUNTING;

SORTED_IP_DEC = ORDER CNT_FOR_IP by COUNTING DESC;
TOP_2_IP = limit SORTED_IP_DEC 2;
STORE TOP_2_IP into '/home/bhavesh/Top2_By_IP/' using PigStorage(',');

SORTED_IP_ASC = ORDER CNT_FOR_IP by COUNTING ASC;
BOTTOM_2_IP = limit SORTED_IP_ASC 2;
STORE BOTTOM_2_IP into '/home/bhavesh/Bottom2_By_IP/' using PigStorage(',');

/* Total page views / Category wise pageviews / Unique pageviews
page,total_number_of_views
category, total_views
page, total_number_of_unique_views */

GRPD_TOTALPAGES = group FILE1 by page;
CNT_FOR_TOTALPAGE = FOREACH GRPD_TOTALPAGES generate group,COUNT(FILE1.page) as COUNTING;
SORTED_PAGES_DEC = ORDER CNT_FOR_TOTALPAGE by COUNTING DESC;
STORE SORTED_PAGES_DEC into '/home/bhavesh/Total_PAGES_Count/' using PigStorage(',');

GRPD_TOTALCATEGORY1 = group FILE1 by Category1;
CNT_FOR_TOTALCATEGORY1 = FOREACH GRPD_TOTALCATEGORY1 generate group,COUNT(FILE1.page) as COUNTING;
SORTED_TOTALCATEGORY1_DEC = ORDER CNT_FOR_TOTALCATEGORY1 by COUNTING DESC;
STORE SORTED_TOTALCATEGORY1_DEC into '/home/bhavesh/Total_Category1_Count/' using PigStorage(',');

GRPD_TOTALCATEGORY2 = group FILE1 by Category2;
CNT_FOR_TOTALCATEGORY2 = FOREACH GRPD_TOTALCATEGORY2 generate group,COUNT(FILE1.page) as COUNTING;
SORTED_TOTALCATEGORY2_DEC = ORDER CNT_FOR_TOTALCATEGORY2 by COUNTING DESC;
STORE SORTED_TOTALCATEGORY2_DEC into '/home/bhavesh/Total_Category2_Count/' using PigStorage(',');

GRPD_TOTALPAGES_UNIQUEVIEW = group FILE1 by page;
CNT_FOR_TOTALPAGE_UNIQUEVIEW = FOREACH GRPD_TOTALPAGES_UNIQUEVIEW {
                                                                internet_protocol = FILE1.IP;
                                                                unique_internet_protocol = DISTINCT internet_protocol;
                                                                GENERATE group, COUNT(unique_internet_protocol);
                                                                };
STORE CNT_FOR_TOTALPAGE_UNIQUEVIEW into '/home/bhavesh/Page_Total_Number_Of_Unique_Views/' using PigStorage(',');

/*
Count of status code = 200 / 404 / 400 / 500
status_code, count
*/

GRPD = group FILE1 by statuscode;  
STATUS_CODE_COUNT = foreach GRPD generate group,COUNT(FILE1.statuscode);

STORE STATUS_CODE_COUNT into '/home/bhavesh/Status_Code_Count/' using PigStorage(',');

Shell Script

#######################################################################
#############################  COMPLETE SCRIPT  #######################
### HEADER - PROGRAM NAME - <weblog.sh>                                                                                      
### AUTHOR - BHAVESH BHADRICHA                                                                                               
### DATE  - 11/DEC/2015                                                                                                      
### VERSION - 1.0                                                                                                            
### DESCRIPTION - Data: It comprises of the information gathered from websites               
### which contains IP, Two Categories of Product, Pages and Status Code                      
###                                                                                                                                                                                           ###
### Problem Statement: Analyse the data in Hadoop Eco-system to:                             
### 1.Load data into HDFS using HDFS client                                                  
###                                                                                                                                                                                           ###
### 2. Develop PIG program to parse WEB logs and meaning full result from it                 
###    INUPT file Format                                                                     
###    IP,Category-1,Category-2,page status_code                                             
###                                                                                                                                                                                           ###
### PIG program to extract data for the following                                            
###                                                                                                                                                                                            ###
### 3. Count of page views by individual user                                                
###    IP, count(*)                                                                                                                                                              ###
###                                                                                                                                                                                            ###
### 4. Top / Bottom 5: catagery-1/ catagery-2 / page /users                                                      
###    (Exclude status code other than 200)                                                                      
###                                                                                                                                                                                            ###
###    Top 5 and bottom 5 records                                                                                                                             ###
###    Category, total_number_views                                                                                                                     ###
###    page, total_number_views                                                                                                                             ###
###    IP, total_number_of_views                                                                                                                            ###
###                                                                                                                                                                                            ###
### 5. Total page views / Category wise pageviews / Unique pageviews                                            ###
###                                                                                                                                                                                            ###
###     page,total_number_of_views                                                                                                                      ###
###                                               category, total_views                                                                                               ###
###                                               page, total_number_of_unique_views                                                        ###
###                                                                                                                                                                                           ###
### 6. Count of status code = 200 / 404 / 400 / 500                                                                                    ###
###    status_code, count                                                                                                                                               ###
###                                                                                                                                                                                            ###
### 7. Load results into tables in MySql Database using Sqoop.?                                                          ###
#####################################################################
#####################################################################
#####################################################################
###DEFINING THE LOCAL VARIABLES###
######################################################################
DATE=$(date +"%%Y%%m%%d_%%H%%M%%S")
LOGFILE="/home/bhavesh/POC/WEBLOG_POC/LOG/"$DATE".log"
#####################################################################
###  Load data into HDFS using HDFS client  #########################
#####################################################################

hadoop fs -put weblog.txt /home/bhavesh/weblog.txt

########################## PIG Processing ###############################
#### PIG, which splits the data into two parts: Category data and Ratings data ###
#########################################################################

echo "Pig Script starts here"

echo "PIG Script,Weblog Processing" >> $LOGFILE

hadoop fs -rmr /home/bhavesh/Total_Category1_Count
hadoop fs -rmr /home/bhavesh/Bottom2_By_PAGE
hadoop fs -rmr /home/bhavesh/Top2_By_Category1
hadoop fs -rmr /home/bhavesh/Top2_By_IP
hadoop fs -rmr /home/bhavesh/Total_PAGES_Count
hadoop fs -rmr /home/bhavesh/Bottom2_By_IP
hadoop fs -rmr /home/bhavesh/Bottom2_By_Category2
hadoop fs -rmr /home/bhavesh/Status_Code_Count
hadoop fs -rmr /home/bhavesh/Count_by_Pageview_by_IndividualUser
hadoop fs -rmr /home/bhavesh/Page_Total_Number_Of_Unique_Views
hadoop fs -rmr /home/bhavesh/Total_Category2_Count
hadoop fs -rmr /home/bhavesh/Bottom2_By_Category1
hadoop fs -rmr /home/bhavesh/Top2_By_Category2
hadoop fs -rmr /home/bhavesh/Top2_By_PAGE

pig /home/bhavesh/POC/WEBLOG_POC/weblog.pig

if [ $? -eq 0 ]; then
    echo "Succesfully finished PIG  Processing " >> $LOGFILE
else
    echo "PIG Processing Failed Please check the Log " >> $LOGFILE
fi


############################ HIVE Processing ###############################
###### HIVE will load the Category data and Rating Data into Hive Tables  ##########
#############################################################################

echo "HIVE Script starts here"

echo "HIVE LOAD data into Table " >> $LOGFILE

hive -e 'drop table if exists TotalCategory1Count';
hive -e 'drop table if exists Bottom2ByPAGE';
hive -e 'drop table if exists Top2ByCategory1';
hive -e 'drop table if exists Top2ByIP';
hive -e 'drop table if exists TotalPAGESCount';
hive -e 'drop table if exists Bottom2ByIP';
hive -e 'drop table if exists Bottom2ByCategory2';
hive -e 'drop table if exists StatusCodeCount';
hive -e 'drop table if exists CountbyPageviewbyIndividualUser';
hive -e 'drop table if exists PageTotalNumberOfUniqueViews';
hive -e 'drop table if exists TotalCategory2Count';
hive -e 'drop table if exists Bottom2ByCategory1';
hive -e 'drop table if exists Top2ByCategory2';
hive -e 'drop table if exists Top2ByPAGE';

hive -e "create external table TotalCategory1Count
(Category1 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/TotalCategory1Count'";

hive -e "create external table Bottom2ByPAGE
(Pages int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Bottom2ByPAGE'";

hive -e "create external table Top2ByCategory1
(Category1 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Top2ByCategory1'";


hive -e "create external table Top2ByIP
(IP string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Top2ByIP'";

hive -e "create external table TotalPAGESCount
(PAGES int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/TotalPAGESCount'";

hive -e "create external table Bottom2ByIP
(IP string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Bottom2ByIP'";

hive -e "create external table Bottom2ByCategory2
(Category2 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Bottom2ByCategory2'";

hive -e "create external table StatusCodeCount
(StatusCode int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/StatusCodeCount'";

hive -e "create external table CountbyPageviewbyIndividualUser
(IP string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/CountbyPageviewbyIndividualUser'";

hive -e "create external table PageTotalNumberOfUniqueViews
(page int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/PageTotalNumberOfUniqueViews'";

hive -e "create external table TotalCategory2Count
(Category2 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/TotalCategory2Count'";

hive -e "create external table Bottom2ByCategory1
(Category1 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Bottom2ByCategory1'";

hive -e "create external table Top2ByCategory2
(Category2 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Top2ByCategory2'";

hive -e "create external table Top2ByPAGE
(page int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Top2ByPAGE'";

hive -e "load data inpath '/home/bhavesh/Total_Category1_Count/part-r-00000' overwrite into table TotalCategory1Count";
hive -e "load data inpath '/home/bhavesh/Bottom2_By_PAGE/part-r-00000' overwrite into table Bottom2ByPAGE";
hive -e "load data inpath '/home/bhavesh/Top2_By_Category1/part-r-00000' overwrite into table Top2ByCategory1";
hive -e "load data inpath '/home/bhavesh/Top2_By_IP/part-r-00000' overwrite into table Top2ByIP";
hive -e "load data inpath '/home/bhavesh/Total_PAGES_Count/part-r-00000' overwrite into table TotalPAGESCount";
hive -e "load data inpath '/home/bhavesh/Bottom2_By_IP/part-r-00000' overwrite into table Bottom2ByIP";
hive -e "load data inpath '/home/bhavesh/Bottom2_By_Category2/part-r-00000' overwrite into table Bottom2ByCategory2";
hive -e "load data inpath '/home/bhavesh/Status_Code_Count/part-r-00000' overwrite into table StatusCodeCount";
hive -e "load data inpath '/home/bhavesh/Count_by_Pageview_by_IndividualUser/part-r-00000' overwrite into table CountbyPageviewbyIndividualUser";
hive -e "load data inpath '/home/bhavesh/Page_Total_Number_Of_Unique_Views/part-r-00000' overwrite into table PageTotalNumberOfUniqueViews";
hive -e "load data inpath '/home/bhavesh/Total_Category2_Count/part-r-00000' overwrite into table TotalCategory2Count";
hive -e "load data inpath '/home/bhavesh/Bottom2_By_Category1/part-r-00000' overwrite into table Bottom2ByCategory1";
hive -e "load data inpath '/home/bhavesh/Top2_By_Category2/part-r-00000' overwrite into table Top2ByCategory2";
hive -e "load data inpath '/home/bhavesh/Top2_By_PAGE/part-r-00000' overwrite into table Top2ByPAGE";


############################ SQOOP Processing ##############################
###### Pushing the HIVE Tale data into RDBMS Tables via SQOOP #######################
#########################################################################

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table TotalCategory1Count --export-dir /home/bhavesh/hive/TotalCategory1Count/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByPAGE --export-dir /home/bhavesh/hive/Bottom2ByPAGE/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByCategory1 --export-dir /home/bhavesh/hive/Top2ByCategory1/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByIP --export-dir /home/bhavesh/hive/Top2ByIP/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table TotalPAGESCount --export-dir /home/bhavesh/hive/TotalPAGESCount/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByIP --export-dir /home/bhavesh/hive/Bottom2ByIP/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByCategory2 --export-dir /home/bhavesh/hive/Bottom2ByCategory2/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table StatusCodeCount --export-dir /home/bhavesh/hive/StatusCodeCount/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table CountbyPageviewbyIndividualUser --export-dir /home/bhavesh/hive/CountbyPageviewbyIndividualUser/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table PageTotalNumberOfUniqueViews --export-dir /home/bhavesh/hive/PageTotalNumberOfUniqueViews/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table TotalCategory2Count --export-dir /home/bhavesh/hive/TotalCategory2Count/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByCategory1 --export-dir /home/bhavesh/hive/Bottom2ByCategory1/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByCategory2 --export-dir /home/bhavesh/hive/Top2ByCategory2/part-r-00000 --input-fields-terminated-by ',';

sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByPAGE --export-dir /home/bhavesh/hive/Top2ByPAGE/part-r-00000 --input-fields-terminated-by ',';
By Bhavesh