Sensex Log Data Processing (PDF File Processing in Map Reduce) Part 1

In this article, We will see how to process Sensex Log (Share Market) which is in PDF format using Big Data Technology, We will see step by step process execution of the project.

Problem Statement: Analyse the data in Hadoop Eco-system to:

  1. Take the complete PDF Input data on HDFS
  2. Develop a Map-Reduce Use Case to get the below-filtered results from the HDFS Input data(Excel data)

  If TYPE OF TRADING is –>’SIP’

       – OPEN_BALANCE > 25000 & FLTUATION_RATE > 10  –> store “HighDemandMarket”

       -CLOSING_BALANCE<22000 & FLTUATION_RATE IN BETWEEN 20 – 30  –> store “OnGoingMarketStretegy”

  If TYPE OF TRADING is –>’SHORTTERM

       – OPEN_BALANCE < 5000 –> store “WealthyProducts”

       – SensexLoc –> “NewYork OR Mumbai”  –> “ReliableProducts

else

       store in “OtherProducts”

  NOTE: In the mentioned file names only 5 outputs have to be generated

  1. Develop a PIG Script to filter the Map Reduce Output in the below fashion

    – Provide the Unique data

    – Sort the Unique data based on SensexID.

  1. EXPORT the same PIG Output from HDFS to MySQL using SQOOP
  2. Store the same PIG Output in a HIVE External Table

Attribute Information or Dataset Details:

Data: Input Format – .PDF (Our Input Data is in PDF Format)​

Like this below created 3000 records on my own 

Technology Used

  1. Apache Hadoop (HDFS)
  2. Mapreduce 
  3. Apache Pig
  4. Apache Hive
  5. MySQL
  6. Shell Script
  7. Apache Sqoop
  8. Linux
  9. Java

Flow Chart

Mapreduce Process​

Mapreduce Code​

PdfInputDriver.java

Java

PdfInputFormat.java​

Java

PdfRecordReader.java​

Java

SensexTradeMapper.java​

Java

SensexTradeReducer.java​

Java
By Bhavesh