Bigdata Hadoop

Running Apache Zeppelin on Docker Desktop (Windows OS)

Running Apache Zeppelin on Docker Desktop (Windows OS)

Apache Zeppelin is an open-source web-based notebook that enables interactive data analytics. It supports multiple languages like Scala, Python, SQL, and more, making it an excellent choice for data engineers, analysts, and scientists working with big data frameworks like Apache Spark, Flink, and Hadoop.Setting up Zeppelin on a Windows system can sometimes be tricky due to dependency and configuration issues. Fortunately, Docker Desktop makes the process simple, reproducible, and fast. In this blog, we’ll walk you through how to run Apache Zeppelin on Docker Desktop on a Windows OS, step-by-step.✅ PrerequisitesBefore you begin, make sure the following are installed on…
Read More
How to Run Apache Druid on Docker Desktop (Windows OS) – A Step-by-Step Guide

How to Run Apache Druid on Docker Desktop (Windows OS) – A Step-by-Step Guide

Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics on large datasets. Running Druid on Docker Desktop in Windows OS enables data engineers and analysts to spin up a full Druid cluster with minimal configuration. In this blog, we'll walk through how to get Apache Druid running locally using Docker.PrerequisitesBefore starting, ensure your system meets the following requirements:Windows 10/11 with WSL 2 enabledDocker Desktop installed and runningMinimum 8GB RAM (16GB recommended for better performance)Git Bash or PowerShell for command-line executionStep 1: Clone the Apache Druid GitHub RepositoryApache Druid provides a quickstart Docker Compose setup in its GitHub…
Read More
Running Hive on Windows Using Docker Desktop: Everything You Need to Know

Running Hive on Windows Using Docker Desktop: Everything You Need to Know

Apache Hive is a powerful data warehouse infrastructure built on top of Apache Hadoop, providing SQL-like querying capabilities for big data processing. Running Hive on Docker simplifies the setup process and ensures a consistent environment across different systems. This guide will walk you through setting up Apache Hive on Docker Desktop on a Windows operating system.PrerequisitesBefore you start, ensure you have the following installed on your Windows system:Docker Desktop (with WSL 2 backend enabled)At least 8GB of RAM for smooth performanceStep 1: Pull the Required Docker ImagesPull the 4.0.1 image from Hive DockerHub  (Latest April 2025)docker pull apache/hive:4.0.1This image comes…
Read More
The roadmap for becoming a Data Engineer 

The roadmap for becoming a Data Engineer 

The roadmap for becoming a Data Engineer typically involves mastering various skills and technologies. Here's a step-by-step guide:Step 1: Learn the FundamentalsProgramming Languages: Start with proficiency in languages like Python, SQL, and possibly Scala or Java.Database Knowledge: Understand different database systems (SQL and NoSQL) and their use cases.Data Structures and Algorithms: Gain a solid understanding of fundamental data structures and algorithms.Mathematics and Statistics: Familiarize yourself with concepts like probability, statistics, and linear algebra.Step 2: Acquire Big Data TechnologiesApache Hadoop: Learn the Hadoop ecosystem tools like HDFS, MapReduce, Hive, and Pig for distributed data processing.Apache Spark: Master Spark for data processing,…
Read More
Installing Apache Druid on the Local Machine

Installing Apache Druid on the Local Machine

Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics ("OLAP" queries) on large data sets. Most often, Druid powers use cases where real-time ingestion, fast query performance, and high uptime are important.Druid is commonly used as the database backend for GUIs of analytical applications, or for highly-concurrent APIs that need fast aggregations. Druid works best with event-oriented data.Common application areas for Druid include:Clickstream analytics including web and mobile analyticsNetwork telemetry analytics including network performance monitoringServer metrics storageSupply chain analytics including manufacturing metricsApplication performance metricsDigital marketing/advertising analyticsBusiness intelligence/OLAP Prerequisites You can follow these steps on a relatively modest…
Read More
Installing Single Node Kafka Cluster

Installing Single Node Kafka Cluster

In this tutorial, we will set up a single-node Kafka Cluster and run it using the command line.Step 1) Let’s start getting the Kafka binary, you can download the Kafka binary from the below linkhttps://kafka.apache.org/Step 2) Click on Download button Click on the binary download to get the download started Kafka is download in the Downloaded folder Moving the Kafka download to the Kafka Directory (ie /home/dataengineer/kafka) Step 3) Unzip Kafkatar -xvzf kafka_2.12-3.6.0.tgz Step 4) START THE KAFKA ENVIRONMENTNOTE: Your local environment must have Java 8+ installed.Apache Kafka can be started using ZooKeeperKafka with ZooKeeperRun the following commands in order…
Read More
Installing Apache Flume on Ubuntu

Installing Apache Flume on Ubuntu

System Requirements:Java Runtime Environment - Java 1.8 or laterMemory - Sufficient memory for configurations used by sources, channels or sinksDisk Space - Sufficient disk space for configurations used by channels or sinksDirectory Permissions - Read/Write permissions for directories used by agentThe first step is to create a folder Flume:Make flume directory in /home/dataengineer/mkdir flumecd flume We need to go to the website  https://flume.apache.org/ and click on download. A new webpage will get open click on  apache-flume-1.11.0-bin.tar.gz A new webpage will get open https://www.apache.org/dyn/closer.lua/flume/1.11.0/apache-flume-1.11.0-bin.tar.gz  and copy the link shown to you. Type the below commandwget https://dlcdn.apache.org/flume/1.11.0/apache-flume-1.11.0-bin.tar.gz You will be able to…
Read More
MySQL client and Server Installation

MySQL client and Server Installation

Step 1: Update/Upgrade Package Repositorysudo apt updatesudo apt upgradeStep 2: Install MySQLsudo apt install mysql-serverWhen asked if you want to continue with the installation, answer Y and hit ENTER.Note: If you only want to connect to a remote MySQL server instead of hosting a database on your machine, install only the MySQL Client by running:sudo apt install mysql-client Step 3: Check if MySQL Service Is Runningsudo systemctl status mysql Step 4: Log in to MySQL Serversudo mysql -u root
Read More
Installing Apache Sqoop on Ubuntu

Installing Apache Sqoop on Ubuntu

Step 1) Create a Sqoop directory by using the command mkdir sqoop so that we can download Apache Sqoop.Step 2) Download the stable version of Apache Sqoop (ie Apache Sqoop 1.4.7 in the year 2022) Website URL https://archive.apache.org/dist/sqoop/1.4.7/wget https://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gzStep 3) Unzip the downloaded file using the tar commandtar -xvzf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gzStep 4) Edit the .bashrc file by using the commandnano .bashrcStep 5) Enter the following commands below in bashrc file and save itexport SQOOP_HOME="/home/dataengineer/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0"export PATH=$PATH:$SQOOP_HOME/bin Step 6) Execute the below command on the command prompt so bashrc gets activated.source ~/.bashrcStep 7) Check the installed sqoop version using the below commandsqoop version…
Read More
Installing Apache Spark 3  in Local Mode – Command Line (Single Node Cluster) on Windows 10

Installing Apache Spark 3  in Local Mode – Command Line (Single Node Cluster) on Windows 10

In this tutorial, we will set up a single node Spark cluster and run it in local mode using the command line.Step 1) Let's start getting the spark binary you can download the spark binary from the below linkDownload Spark link: https://spark.apache.org/Windows Utils link: https://github.com/steveloughran/winutilsStep 2) Click on Download Step 3) A new Web page will get open i) Choose a Spark release as 3.0.3ii) Choose a package type as Pre-built for Apache Hadoop 2.7 Step 4) Click on Download Spark spark-3.0.3-bin-hadoop2.7.tgz Step 5) A new Web Page will get open Step 6) Click on the link to download Step 7)…
Read More