Analyze social bookmarking sites to find insights Part 1

In this article, we will Analyze social bookmarking sites to find insights using Big Data Technology, Data comprises of the information gathered from sites that are bookmarking sites and allow you to bookmark, review, rate, on a specific topic. A bookmarking site allows you to bookmark, review, rate, search various links on any topic. The data is in XML format and contains various categories defining it and the ratings linked with it.

Problem Statement: Analyse the data in Hadoop Eco-system to:

  1. Fetch the data into Hadoop Distributed File System and analyze it with the help of MapReduce, Pig, and Hive to find the top-rated links based on the user comments, likes, etc.
  2. Using MapReduce convert the semi-structured format (XML data) into structured
  3. Push the (MapReduce) output HDFS and then feed it into PIG, which splits the data into two parts: Category data and Rating data.
  4. Write a fancy Hive Query to analyze the data further and push the output is into a relational database (RDBMS) using Sqoop.

Attribute Information or Dataset Details:

Data: Input Format – .JSON

Technology Used​

  1. Apache Hadoop (HDFS)
  2. Mapreduce 
  3. Apache Pig
  4. Apache Hive
  5. MySQL
  6. Shell Script
  7. Apache Sqoop
  8. Linux
  9. Java

MapReduce Code to convert XML File to Flat File or Comma Separated File.​

MyMapper.java​

Java

XMLDriver.java​

Java

XMLInputFormat.java​

Java

Apache Pig Script -bookmarkanalysis.pig​

Pig

Shell Script

Shell
By Bhavesh