YouTube Data Analysis Part-1

In this article, We will see how to Analyze YouTube Data using Big Data Technology, We will see step by step process execution of the project.

YouTube is an American online video-sharing platform headquartered in San Bruno, California. Three former PayPal employees—Chad Hurley, Steve Chen, and Jawed Karim created the service in February 2005.

Problem Statement: Problem Statement is to

1) Find out the top 5 categories in which the most number of videos are uploaded.

2) Find top 10 rated videos,

3) Find top 10 most viewed videos

4) Find top 10 rated videos in each category

5) Find top 10 most viewed videos in each category

Attribute Information or Dataset Details:

Data: Input Format – TAB Separated Values File (tsv file)

Public DATASET available at below website
http://netsg.cs.sfu.ca/youtubedata/0222.zip

Technology Used​

  • Apache Hadoop
  • Apache Pig
  • Bigdata
  • Linux
  • Shell Script

Processing Logic – Youtube Data Analysis in Hadoop Eco-System.

Apache Pig Script will perform the following

1) Find out the top 5 categories in which the most number of videos are uploaded.
2) Find top 10 rated videos,
3) Find top 10 most viewed videos
4) Find top 10 rated videos in each category
5) Find top 10 most viewed videos in each category

Apache Pig Code (Youtube_data_analysis.pig)

Pig

Shell Script will perform the following

Purpose of this shell script is to perform clean-up (delete existing output files) and execute the Pig Script and Hive Commands to store the resultant in Hive Tables and Store result in file (CSV format).

Shell Script Code (Youtube_data_analysis.sh)

Shell
By Bhavesh