In this article, We have explore generating Analytics from a Product based Company using Web Log. Even by using small data, We could still gain a lot of valuable insights.

Problem Statement: Generate Analytics based on the data in Hadoop Eco-system:
- Load weblog data into HDFS using HDFS client
- Develop Pig program to load log and perform analytics on IP Category-1 Category-2 page, status_code
2.1. Count of page views by individual user ie [IP, count(*)]
2.2. Top / Bottom 2: catagery-1/ catagery-2 / page /users (Exclude status code other than 200)
Top 2 and bottom 2 records
- Category, total_number_views
- page, total_number_views
- IP, total_number_of_views
2.3. Total page views / Category wise pageviews / Unique pageviews
- page,total_number_of_views
- category, total_views
- page, total_number_of_unique_views
2.4. Count of status code = 200 / 404 / 400 / 500
- status_code, count
- Load results into tables in MySql Database using Sqoop.

Attribute Information or Dataset Details:
| Column Name | Data Type |
|---|---|
| IP | String |
| Category1 | String |
| Category2 | String |
| page | Integer |
| statuscode | Integer |
Data: Input Format – Text

Technology Used
- Apache Hadoop (HDFS)
- Apache Pig
- Apache Hive
- MySQL
- Shell Script
- Apache Sqoop
- Linux
Apache Pig Code
weblog.pig
/* Develop Pig program to extract data for the following KPIs */
FILE1 = LOAD '/home/bhavesh/weblog.txt' using PigStorage(',') as (IP:chararray,Category1:chararray,Category2:chararray,page:int,statuscode:int);
GRPD = group FILE1 by IP;
IPCOUNT = foreach GRPD generate group,COUNT(FILE1.IP);
/* Count of page views by individual user ie IP, count(*) */
STORE IPCOUNT into '/home/bhavesh/Count_by_Pageview_by_IndividualUser/' using PigStorage(',');
/*(Exclude status code other than 200) */
FILE_STATUS_CODE_200 = FILTER FILE1 by statuscode == 200;
/*
Top 2 and bottom 2 records
Category, total_number_views
page, total_number_views
IP, total_number_of_views
*/
GRPD_CATEGORY1 = group FILE_STATUS_CODE_200 by Category1;
CNT_FOR_CATEGORY1 = FOREACH GRPD_CATEGORY1 generate group,COUNT(FILE_STATUS_CODE_200.Category1) as COUNTING;
SORTED_CATEGORY1_DEC = ORDER CNT_FOR_CATEGORY1 by COUNTING DESC;
TOP_2_CATEGORY1 = limit SORTED_CATEGORY1_DEC 2;
STORE TOP_2_CATEGORY1 into '/home/bhavesh/Top2_By_Category1/' using PigStorage(',');
SORTED_CATEGORY1_ASC = ORDER CNT_FOR_CATEGORY1 by COUNTING ASC;
BOTTOM_2_CATEGORY1 = limit SORTED_CATEGORY1_ASC 2;
STORE BOTTOM_2_CATEGORY1 into '/home/bhavesh/Bottom2_By_Category1/' using PigStorage(',');
GRPD_CATEGORY2 = group FILE_STATUS_CODE_200 by Category2;
CNT_FOR_CATEGORY2 = FOREACH GRPD_CATEGORY2 generate group,COUNT(FILE_STATUS_CODE_200.Category2) as COUNTING;
SORTED_CATEGORY2_DEC = ORDER CNT_FOR_CATEGORY2 by COUNTING DESC;
TOP_2_CATEGORY2 = limit SORTED_CATEGORY2_DEC 2;
STORE TOP_2_CATEGORY2 into '/home/bhavesh/Top2_By_Category2/' using PigStorage(',');
SORTED_CATEGORY2_ASC = ORDER CNT_FOR_CATEGORY2 by COUNTING ASC;
BOTTOM_2_CATEGORY2 = limit SORTED_CATEGORY2_ASC 2;
STORE BOTTOM_2_CATEGORY2 into '/home/bhavesh/Bottom2_By_Category2/' using PigStorage(',');
GRPD_PAGES = group FILE_STATUS_CODE_200 by page;
CNT_FOR_PAGE = FOREACH GRPD_PAGES generate group,COUNT(FILE_STATUS_CODE_200.page) as COUNTING;
SORTED_PAGE_DEC = ORDER CNT_FOR_PAGE by COUNTING DESC;
TOP_2_PAGE = limit SORTED_PAGE_DEC 2;
STORE TOP_2_PAGE into '/home/bhavesh/Top2_By_PAGE/' using PigStorage(',');
SORTED_PAGE_ASC = ORDER CNT_FOR_PAGE by COUNTING ASC;
BOTTOM_2_PAGE = limit SORTED_PAGE_ASC 2;
STORE BOTTOM_2_PAGE into '/home/bhavesh/Bottom2_By_PAGE/' using PigStorage(',');
GRPD_IP = group FILE_STATUS_CODE_200 by IP;
CNT_FOR_IP = FOREACH GRPD_IP generate group,COUNT(FILE_STATUS_CODE_200.IP) as COUNTING;
SORTED_IP_DEC = ORDER CNT_FOR_IP by COUNTING DESC;
TOP_2_IP = limit SORTED_IP_DEC 2;
STORE TOP_2_IP into '/home/bhavesh/Top2_By_IP/' using PigStorage(',');
SORTED_IP_ASC = ORDER CNT_FOR_IP by COUNTING ASC;
BOTTOM_2_IP = limit SORTED_IP_ASC 2;
STORE BOTTOM_2_IP into '/home/bhavesh/Bottom2_By_IP/' using PigStorage(',');
/* Total page views / Category wise pageviews / Unique pageviews
page,total_number_of_views
category, total_views
page, total_number_of_unique_views */
GRPD_TOTALPAGES = group FILE1 by page;
CNT_FOR_TOTALPAGE = FOREACH GRPD_TOTALPAGES generate group,COUNT(FILE1.page) as COUNTING;
SORTED_PAGES_DEC = ORDER CNT_FOR_TOTALPAGE by COUNTING DESC;
STORE SORTED_PAGES_DEC into '/home/bhavesh/Total_PAGES_Count/' using PigStorage(',');
GRPD_TOTALCATEGORY1 = group FILE1 by Category1;
CNT_FOR_TOTALCATEGORY1 = FOREACH GRPD_TOTALCATEGORY1 generate group,COUNT(FILE1.page) as COUNTING;
SORTED_TOTALCATEGORY1_DEC = ORDER CNT_FOR_TOTALCATEGORY1 by COUNTING DESC;
STORE SORTED_TOTALCATEGORY1_DEC into '/home/bhavesh/Total_Category1_Count/' using PigStorage(',');
GRPD_TOTALCATEGORY2 = group FILE1 by Category2;
CNT_FOR_TOTALCATEGORY2 = FOREACH GRPD_TOTALCATEGORY2 generate group,COUNT(FILE1.page) as COUNTING;
SORTED_TOTALCATEGORY2_DEC = ORDER CNT_FOR_TOTALCATEGORY2 by COUNTING DESC;
STORE SORTED_TOTALCATEGORY2_DEC into '/home/bhavesh/Total_Category2_Count/' using PigStorage(',');
GRPD_TOTALPAGES_UNIQUEVIEW = group FILE1 by page;
CNT_FOR_TOTALPAGE_UNIQUEVIEW = FOREACH GRPD_TOTALPAGES_UNIQUEVIEW {
internet_protocol = FILE1.IP;
unique_internet_protocol = DISTINCT internet_protocol;
GENERATE group, COUNT(unique_internet_protocol);
};
STORE CNT_FOR_TOTALPAGE_UNIQUEVIEW into '/home/bhavesh/Page_Total_Number_Of_Unique_Views/' using PigStorage(',');
/*
Count of status code = 200 / 404 / 400 / 500
status_code, count
*/
GRPD = group FILE1 by statuscode;
STATUS_CODE_COUNT = foreach GRPD generate group,COUNT(FILE1.statuscode);
STORE STATUS_CODE_COUNT into '/home/bhavesh/Status_Code_Count/' using PigStorage(',');Shell Script
#######################################################################
############################# COMPLETE SCRIPT #######################
### HEADER - PROGRAM NAME - <weblog.sh>
### AUTHOR - BHAVESH BHADRICHA
### DATE - 11/DEC/2015
### VERSION - 1.0
### DESCRIPTION - Data: It comprises of the information gathered from websites
### which contains IP, Two Categories of Product, Pages and Status Code
### ###
### Problem Statement: Analyse the data in Hadoop Eco-system to:
### 1.Load data into HDFS using HDFS client
### ###
### 2. Develop PIG program to parse WEB logs and meaning full result from it
### INUPT file Format
### IP,Category-1,Category-2,page status_code
### ###
### PIG program to extract data for the following
### ###
### 3. Count of page views by individual user
### IP, count(*) ###
### ###
### 4. Top / Bottom 5: catagery-1/ catagery-2 / page /users
### (Exclude status code other than 200)
### ###
### Top 5 and bottom 5 records ###
### Category, total_number_views ###
### page, total_number_views ###
### IP, total_number_of_views ###
### ###
### 5. Total page views / Category wise pageviews / Unique pageviews ###
### ###
### page,total_number_of_views ###
### category, total_views ###
### page, total_number_of_unique_views ###
### ###
### 6. Count of status code = 200 / 404 / 400 / 500 ###
### status_code, count ###
### ###
### 7. Load results into tables in MySql Database using Sqoop.? ###
#####################################################################
#####################################################################
#####################################################################
###DEFINING THE LOCAL VARIABLES###
######################################################################
DATE=$(date +"%Y%m%d_%H%M%S")
LOGFILE="/home/bhavesh/POC/WEBLOG_POC/LOG/"$DATE".log"
#####################################################################
### Load data into HDFS using HDFS client #########################
#####################################################################
hadoop fs -put weblog.txt /home/bhavesh/weblog.txt
########################## PIG Processing ###############################
#### PIG, which splits the data into two parts: Category data and Ratings data ###
#########################################################################
echo "Pig Script starts here"
echo "PIG Script,Weblog Processing" >> $LOGFILE
hadoop fs -rmr /home/bhavesh/Total_Category1_Count
hadoop fs -rmr /home/bhavesh/Bottom2_By_PAGE
hadoop fs -rmr /home/bhavesh/Top2_By_Category1
hadoop fs -rmr /home/bhavesh/Top2_By_IP
hadoop fs -rmr /home/bhavesh/Total_PAGES_Count
hadoop fs -rmr /home/bhavesh/Bottom2_By_IP
hadoop fs -rmr /home/bhavesh/Bottom2_By_Category2
hadoop fs -rmr /home/bhavesh/Status_Code_Count
hadoop fs -rmr /home/bhavesh/Count_by_Pageview_by_IndividualUser
hadoop fs -rmr /home/bhavesh/Page_Total_Number_Of_Unique_Views
hadoop fs -rmr /home/bhavesh/Total_Category2_Count
hadoop fs -rmr /home/bhavesh/Bottom2_By_Category1
hadoop fs -rmr /home/bhavesh/Top2_By_Category2
hadoop fs -rmr /home/bhavesh/Top2_By_PAGE
pig /home/bhavesh/POC/WEBLOG_POC/weblog.pig
if [ $? -eq 0 ]; then
echo "Succesfully finished PIG Processing " >> $LOGFILE
else
echo "PIG Processing Failed Please check the Log " >> $LOGFILE
fi
############################ HIVE Processing ###############################
###### HIVE will load the Category data and Rating Data into Hive Tables ##########
#############################################################################
echo "HIVE Script starts here"
echo "HIVE LOAD data into Table " >> $LOGFILE
hive -e 'drop table if exists TotalCategory1Count';
hive -e 'drop table if exists Bottom2ByPAGE';
hive -e 'drop table if exists Top2ByCategory1';
hive -e 'drop table if exists Top2ByIP';
hive -e 'drop table if exists TotalPAGESCount';
hive -e 'drop table if exists Bottom2ByIP';
hive -e 'drop table if exists Bottom2ByCategory2';
hive -e 'drop table if exists StatusCodeCount';
hive -e 'drop table if exists CountbyPageviewbyIndividualUser';
hive -e 'drop table if exists PageTotalNumberOfUniqueViews';
hive -e 'drop table if exists TotalCategory2Count';
hive -e 'drop table if exists Bottom2ByCategory1';
hive -e 'drop table if exists Top2ByCategory2';
hive -e 'drop table if exists Top2ByPAGE';
hive -e "create external table TotalCategory1Count
(Category1 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/TotalCategory1Count'";
hive -e "create external table Bottom2ByPAGE
(Pages int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Bottom2ByPAGE'";
hive -e "create external table Top2ByCategory1
(Category1 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Top2ByCategory1'";
hive -e "create external table Top2ByIP
(IP string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Top2ByIP'";
hive -e "create external table TotalPAGESCount
(PAGES int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/TotalPAGESCount'";
hive -e "create external table Bottom2ByIP
(IP string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Bottom2ByIP'";
hive -e "create external table Bottom2ByCategory2
(Category2 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Bottom2ByCategory2'";
hive -e "create external table StatusCodeCount
(StatusCode int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/StatusCodeCount'";
hive -e "create external table CountbyPageviewbyIndividualUser
(IP string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/CountbyPageviewbyIndividualUser'";
hive -e "create external table PageTotalNumberOfUniqueViews
(page int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/PageTotalNumberOfUniqueViews'";
hive -e "create external table TotalCategory2Count
(Category2 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/TotalCategory2Count'";
hive -e "create external table Bottom2ByCategory1
(Category1 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Bottom2ByCategory1'";
hive -e "create external table Top2ByCategory2
(Category2 string,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Top2ByCategory2'";
hive -e "create external table Top2ByPAGE
(page int,
countings int)
row format delimited
fields terminated by','
lines terminated by '\n'
stored as textfile location '/home/bhavesh/hive/Top2ByPAGE'";
hive -e "load data inpath '/home/bhavesh/Total_Category1_Count/part-r-00000' overwrite into table TotalCategory1Count";
hive -e "load data inpath '/home/bhavesh/Bottom2_By_PAGE/part-r-00000' overwrite into table Bottom2ByPAGE";
hive -e "load data inpath '/home/bhavesh/Top2_By_Category1/part-r-00000' overwrite into table Top2ByCategory1";
hive -e "load data inpath '/home/bhavesh/Top2_By_IP/part-r-00000' overwrite into table Top2ByIP";
hive -e "load data inpath '/home/bhavesh/Total_PAGES_Count/part-r-00000' overwrite into table TotalPAGESCount";
hive -e "load data inpath '/home/bhavesh/Bottom2_By_IP/part-r-00000' overwrite into table Bottom2ByIP";
hive -e "load data inpath '/home/bhavesh/Bottom2_By_Category2/part-r-00000' overwrite into table Bottom2ByCategory2";
hive -e "load data inpath '/home/bhavesh/Status_Code_Count/part-r-00000' overwrite into table StatusCodeCount";
hive -e "load data inpath '/home/bhavesh/Count_by_Pageview_by_IndividualUser/part-r-00000' overwrite into table CountbyPageviewbyIndividualUser";
hive -e "load data inpath '/home/bhavesh/Page_Total_Number_Of_Unique_Views/part-r-00000' overwrite into table PageTotalNumberOfUniqueViews";
hive -e "load data inpath '/home/bhavesh/Total_Category2_Count/part-r-00000' overwrite into table TotalCategory2Count";
hive -e "load data inpath '/home/bhavesh/Bottom2_By_Category1/part-r-00000' overwrite into table Bottom2ByCategory1";
hive -e "load data inpath '/home/bhavesh/Top2_By_Category2/part-r-00000' overwrite into table Top2ByCategory2";
hive -e "load data inpath '/home/bhavesh/Top2_By_PAGE/part-r-00000' overwrite into table Top2ByPAGE";
############################ SQOOP Processing ##############################
###### Pushing the HIVE Tale data into RDBMS Tables via SQOOP #######################
#########################################################################
sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table TotalCategory1Count --export-dir /home/bhavesh/hive/TotalCategory1Count/part-r-00000 --input-fields-terminated-by ',';
sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByPAGE --export-dir /home/bhavesh/hive/Bottom2ByPAGE/part-r-00000 --input-fields-terminated-by ',';
sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByCategory1 --export-dir /home/bhavesh/hive/Top2ByCategory1/part-r-00000 --input-fields-terminated-by ',';
sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByIP --export-dir /home/bhavesh/hive/Top2ByIP/part-r-00000 --input-fields-terminated-by ',';
sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table TotalPAGESCount --export-dir /home/bhavesh/hive/TotalPAGESCount/part-r-00000 --input-fields-terminated-by ',';
sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByIP --export-dir /home/bhavesh/hive/Bottom2ByIP/part-r-00000 --input-fields-terminated-by ',';
sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByCategory2 --export-dir /home/bhavesh/hive/Bottom2ByCategory2/part-r-00000 --input-fields-terminated-by ',';
sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table StatusCodeCount --export-dir /home/bhavesh/hive/StatusCodeCount/part-r-00000 --input-fields-terminated-by ',';
sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table CountbyPageviewbyIndividualUser --export-dir /home/bhavesh/hive/CountbyPageviewbyIndividualUser/part-r-00000 --input-fields-terminated-by ',';
sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table PageTotalNumberOfUniqueViews --export-dir /home/bhavesh/hive/PageTotalNumberOfUniqueViews/part-r-00000 --input-fields-terminated-by ',';
sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table TotalCategory2Count --export-dir /home/bhavesh/hive/TotalCategory2Count/part-r-00000 --input-fields-terminated-by ',';
sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Bottom2ByCategory1 --export-dir /home/bhavesh/hive/Bottom2ByCategory1/part-r-00000 --input-fields-terminated-by ',';
sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByCategory2 --export-dir /home/bhavesh/hive/Top2ByCategory2/part-r-00000 --input-fields-terminated-by ',';
sqoop export --connect jdbc:mysql://localhost/mysql --username root --password root --table Top2ByPAGE --export-dir /home/bhavesh/hive/Top2ByPAGE/part-r-00000 --input-fields-terminated-by ',';
