Bigdata Hadoop Archives - Projects Based Learning

The roadmap for becoming a Data Engineer

Bigdata HadoopApache Cassandra, Apache Druid, Apache Flume, Apache Hadoop, Apache Hive, Apache Kafka, Apache Spark, Apache Spark Machine Learning, Bigdata, Hadoop

The roadmap for becoming a Data Engineer typically involves mastering various skills and technologies. Here's a step-by-step guide:Step 1: Learn the FundamentalsProgramming Languages: Start with proficiency in languages like Python, SQL, and possibly Scala or Java.Database Knowledge: Understand different database systems (SQL and NoSQL) and their use cases.Data Structures and Algorithms: Gain a solid understanding of fundamental data structures and algorithms.Mathematics and Statistics: Familiarize yourself with concepts like probability, statistics, and linear algebra.Step 2: Acquire Big Data TechnologiesApache Hadoop: Learn the Hadoop ecosystem tools like HDFS, MapReduce, Hive, and Pig for distributed data processing.Apache Spark: Master Spark for data processing,…

Installing Apache Druid on the Local Machine

Bigdata HadoopApache Druid, Apache Hadoop, Apache Hive, Apache Kafka, Bigdata, Hadoop

Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics ("OLAP" queries) on large data sets. Most often, Druid powers use cases where real-time ingestion, fast query performance, and high uptime are important.Druid is commonly used as the database backend for GUIs of analytical applications, or for highly-concurrent APIs that need fast aggregations. Druid works best with event-oriented data.Common application areas for Druid include:Clickstream analytics including web and mobile analyticsNetwork telemetry analytics including network performance monitoringServer metrics storageSupply chain analytics including manufacturing metricsApplication performance metricsDigital marketing/advertising analyticsBusiness intelligence/OLAP Prerequisites You can follow these steps on a relatively modest…

Installing Single Node Kafka Cluster

Bigdata HadoopApache Hadoop, Apache Kafka, Bigdata, Hadoop, Kafka

In this tutorial, we will set up a single-node Kafka Cluster and run it using the command line.Step 1) Let’s start getting the Kafka binary, you can download the Kafka binary from the below linkhttps://kafka.apache.org/Step 2) Click on Download button Click on the binary download to get the download started Kafka is download in the Downloaded folder Moving the Kafka download to the Kafka Directory (ie /home/dataengineer/kafka) Step 3) Unzip Kafkatar -xvzf kafka_2.12-3.6.0.tgz Step 4) START THE KAFKA ENVIRONMENTNOTE: Your local environment must have Java 8+ installed.Apache Kafka can be started using ZooKeeperKafka with ZooKeeperRun the following commands in order…

Installing Apache Flume on Ubuntu

Bigdata HadoopApache Flume, Bigdata, Flume, Hadoop, Linux

System Requirements:Java Runtime Environment - Java 1.8 or laterMemory - Sufficient memory for configurations used by sources, channels or sinksDisk Space - Sufficient disk space for configurations used by channels or sinksDirectory Permissions - Read/Write permissions for directories used by agentThe first step is to create a folder Flume:Make flume directory in /home/dataengineer/mkdir flumecd flume We need to go to the website https://flume.apache.org/ and click on download. A new webpage will get open click on apache-flume-1.11.0-bin.tar.gz A new webpage will get open https://www.apache.org/dyn/closer.lua/flume/1.11.0/apache-flume-1.11.0-bin.tar.gz and copy the link shown to you. Type the below commandwget https://dlcdn.apache.org/flume/1.11.0/apache-flume-1.11.0-bin.tar.gz You will be able to…

MySQL client and Server Installation

Bigdata HadoopLinux, MySQL

Step 1: Update/Upgrade Package Repositorysudo apt updatesudo apt upgradeStep 2: Install MySQLsudo apt install mysql-serverWhen asked if you want to continue with the installation, answer Y and hit ENTER.Note: If you only want to connect to a remote MySQL server instead of hosting a database on your machine, install only the MySQL Client by running:sudo apt install mysql-client Step 3: Check if MySQL Service Is Runningsudo systemctl status mysql Step 4: Log in to MySQL Serversudo mysql -u root

Installing Apache Sqoop on Ubuntu

Bigdata HadoopApache Hadoop, Apache Sqoop, Bigdata, Hadoop

Step 1) Create a Sqoop directory by using the command mkdir sqoop so that we can download Apache Sqoop.Step 2) Download the stable version of Apache Sqoop (ie Apache Sqoop 1.4.7 in the year 2022) Website URL https://archive.apache.org/dist/sqoop/1.4.7/wget https://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gzStep 3) Unzip the downloaded file using the tar commandtar -xvzf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gzStep 4) Edit the .bashrc file by using the commandnano .bashrcStep 5) Enter the following commands below in bashrc file and save itexport SQOOP_HOME="/home/dataengineer/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0"export PATH=$PATH:$SQOOP_HOME/bin Step 6) Execute the below command on the command prompt so bashrc gets activated.source ~/.bashrcStep 7) Check the installed sqoop version using the below commandsqoop version…

Installing Apache Spark 3 in Local Mode – Command Line (Single Node Cluster) on Windows 10

Bigdata HadoopApache Spark

In this tutorial, we will set up a single node Spark cluster and run it in local mode using the command line.Step 1) Let's start getting the spark binary you can download the spark binary from the below linkDownload Spark link: https://spark.apache.org/Windows Utils link: https://github.com/steveloughran/winutilsStep 2) Click on Download Step 3) A new Web page will get open i) Choose a Spark release as 3.0.3ii) Choose a package type as Pre-built for Apache Hadoop 2.7 Step 4) Click on Download Spark spark-3.0.3-bin-hadoop2.7.tgz Step 5) A new Web Page will get open Step 6) Click on the link to download Step 7)…

Install Apache Spark On Ubuntu

Bigdata HadoopApache Spark

With this tutorial, we will learn the complete process to install Apache Spark 3.2.0 on Ubuntu 20. Prerequisite: Spark runs on Java 8/11, Scala 2.12, Python 3.6+ and R 3.5+. Python 3.6 support is deprecated as of Spark 3.2.0. Java 8 prior to version 8u201 support is deprecated as of Spark 3.2.0. For the Scala API, Spark 3.2.0 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x). Steps for Installing Apache Spark Step 1 - Create a directory for example $mkdir /home/bigdata/apachespark Step 2 - Move to Apache Spark directory $cd /home/bigdata/apachespark Step 3 - Download…

Apache Hive Installation Steps on Ubuntu

Bigdata HadoopApache Hadoop, Apache Hive, Bigdata, Hadoop, Linux

With this tutorial, we will learn the complete process to install Apache Hive 3.1.2 on Ubuntu 20.The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.Steps for Installing Hadoop on UbuntuStep 1 - Create a directory for example $mkdir /home/bigdata/apachehive Step 2 - Move to hadoop directory $cd /home/bigdata/apachehive Step 3 - Download Apache Hive (Link will change with respect to country so please get the download link from…

Installing Apache Pig on Ubuntu

Bigdata HadoopApache Pig, Linux

Download a recent stable release from one of the Apache Download website https://pig.apache.org/releases.htmlClick on Download A new Page will get open (https://www.apache.org/dyn/closer.cgi/pig) Click on the link as marked in the below image A new page will get Open Click on Latest Folder so the new page will get open Download the file as shown in the image We have downloaded the file in directory /home/dataengineer/apachepig/ Unzip the file using the below commandtar -xvzf pig-0.17.0.tar.gzAdd /pig-n.n.n/bin to your path. Use export (bash,sh,ksh) or setenv (tcsh,csh). For example:$ export PATH=/<my-path-to-pig>/pig-n.n.n/bin:$PATHExecuting Pig Help Command

Tags