Project idea – The idea behind this Analysis project is to analysis a person makes a doctor’s appointment, receives all the instructions, and no-show. Who to blame?
Problem Statement or Business Problem
Problem
A person makes a doctor’s appointment, receives all the instructions, and no-show. Who to blame?
In this tutorial we will try to analyze why would some patient not show up for his medical appointment and whether there are reasons for that using the data we have. We will try to find some correlation between the different attributes we have and whether the patient shows up or not. The dataset we are going to use contains 110527 medical appointments and its 14 associated variables ( PatientId, AppointmentID, Gender, ScheduledDay, AppointmentDay, Age, Neighbourhood, Scholarship, Hypertension, Diabetes, Alcoholism, Handcap’, SMS_received, No-show )
Attribute Information or Dataset Details:
- PatientId: double (nullable = true)
- AppointmentID: integer (nullable = true)
- Gender: string (nullable = true)
- ScheduledDay: timestamp (nullable = true)
- AppointmentDay: timestamp (nullable = true)
- Age: integer (nullable = true)
- Neighbourhood: string (nullable = true)
- Scholarship: integer (nullable = true)
- Hipertension: integer (nullable = true)
- Diabetes: integer (nullable = true)
- Alcoholism: integer (nullable = true)
- Handcap: integer (nullable = true)
- SMS_received: integer (nullable = true)
- No-show: string (nullable = true)
Technology Used
- Apache Spark
- Spark SQL
- Scala
- DataFrame-based API
- Databricks Notebook
Introduction
Welcome to this project on Medical Appointment Data Analysis in Apache Spark Analytics using Databricks platform community edition server which allows you to execute your spark code, free of cost on their server just by registering through email id.
In this project, we explore Apache Spark on the Databricks platform.
I am a firm believer that the best way to learn is by doing. That’s why I haven’t included any purely theoretical lectures in this tutorial: you will learn everything on the way and be able to put it into practice straight away. Seeing the way each feature works will help you learn Apache Spark thoroughly by heart.
We’re going to look at how to set up a Spark Cluster and get started with that. And we’ll look at how we can then use that Spark Cluster to take data coming into that Spark Cluster, a process that data, and analyze the data in Databricks platform. That’s pretty much what we’re going to learn in this tutorial.
In this project, we will be performing Medical Appointment Data Analysis
We will learn:
- Preparing the Data for Processing.
- Basics flow of data in Apache Spark, loading data, and working with data, this course shows you how Apache Spark is perfect for a Data Analysis job.
- Learn the basics of Databricks notebook by enrolling in Free Community Edition Server
The goal is to provide you with practical tools that will be beneficial for you in the future. While doing that, you’ll develop a model with a real use opportunity.
I am really excited you are here, I hope you are going to follow all the way to the end of the Project. It is fairly straight forward fairly easy to follow through the article we will show you step by step each line of code & we will explain what it does and why we are doing it.
Free Account creation in Databricks
Creating a Spark Cluster
Creating a Spark Cluster
Basics about Databricks notebook
Basics about Databricks notebook
Loading Data into Databricks Environment
Download Data
Load Data in Dataframe
val medical_data = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/no_showappointments.csv")
display(medical_data)
Count of Data
medical_data.count()
res2: Long = 110527
Statistics of Data
display(medical_data.describe())
Print Schema of Dataframe
medical_data.printSchema()
root
|-- PatientId: double (nullable = true)
|-- AppointmentID: integer (nullable = true)
|-- Gender: string (nullable = true)
|-- ScheduledDay: timestamp (nullable = true)
|-- AppointmentDay: timestamp (nullable = true)
|-- Age: integer (nullable = true)
|-- Neighbourhood: string (nullable = true)
|-- Scholarship: integer (nullable = true)
|-- Hipertension: integer (nullable = true)
|-- Diabetes: integer (nullable = true)
|-- Alcoholism: integer (nullable = true)
|-- Handcap: integer (nullable = true)
|-- SMS_received: integer (nullable = true)
|-- No-show: string (nullable = true)
Exploratory Data Analysis or EDA
Creating Temporary View
medical_data.createOrReplaceTempView("MedicalData")
Count of no-show
Patient who missed their appointment based on Gender
Are patients with hypertension more likely to miss their appointment?
Are patients with scholarships more likely to miss their appointment?
Are patients who don't receive SMS more likely to miss their appointment?
Does age affect whether a patient will show up or not?
Patients missing their appointments for every neighborhood?
Conclusions
After analyzing the dataset here are some findings:
- Percentage of patients who didn’t show up for their appointment is 20.19%.
- The percentage of females missing their appointment is nearly two times the number of males. So females are more likely to miss their appointment.
- It seems that patients with scholarships are actually more likely to miss their appointment.
- A strange finding here suggests that patients who received an SMS are more likely to miss their appointment !!
- There is no clear relation between the age and whether the patients show up or not but younger patients are more likely to miss their appointments.