Analytics on India census using Apache Spark Part 1

In this article, We have explored Census data for India to understand changes in India’s demographics, population growth, religion distribution, gender distribution, and sex ratio, etc. Even by using small data, I could still gain a lot of valuable insights about the country. I have used Spark SQL and Inbuild graphs provided by Databricks.

India is the second-most populous country in the world, with over 1.271 billion people, more than a sixth of the world’s population. Already containing 17.5% of the world’s population, India is projected to be the world’s most populous country by 2025, surpassing China, its population reaching 1.6 billion by 2050. Its population growth rate is 1.2%.

Attribute Information or Dataset Details:

col_namedata_typecomment
SerialNostringnull
Statestringnull
Districtstringnull
Personsbigintnull
Malesbigintnull
Femalesbigintnull
Growthin1991to2001floatnull
Ruralbigintnull
Urbanbigintnull
ScheduledCastepopulationbigintnull
PercentageSC_tototalbigintnull
Numberofhouseholdsbigintnull
Householdsizeperhouseholdbigintnull
Sexratiofemales_per_1000_males_bigintnull
Sex_ratio_0_6_years_bigintnull
Scheduled_Tribe_populationbigintnull
Percentage_to_total_population_ST_floatnull
Persons_literatebigintnull
Males_Literatebigintnull
Females_Literatebigintnull
Persons_literacy_ratefloatnull
Males_Literatacy_Ratefloatnull
Females_Literacy_Ratefloatnull
Total_Educatedbigintnull
Data_without_levelbigintnull
Below_Primarybigintnull
Primarybigintnull
Middlebigintnull
Matric_Higher_Secondary_Diplomabigintnull
Graduate_and_Abovebigintnull
X0__4_yearsbigintnull
X5__14_yearsbigintnull
X15__59_yearsbigintnull
X60_years_and_above_Incl_A_N_S_bigintnull
Total_workersbigintnull
Main_workersbigintnull
Margi0l_workersbigintnull
Non_workersbigintnull
SC_1_0mestringnull
SC_1_Populationbigintnull
SC_2_0mestringnull
SC_2_Populationbigintnull
SC_3_0mestringnull
SC_3_Populationbigintnull
Religeon_1_0mestringnull
Religeon_1_Populationbigintnull
Religeon_2_0mestringnull
Religeon_2_Populationbigintnull
Religeon_3_0mestringnull
Religeon_3_Populationbigintnull
ST_1_0mestringnull
ST_1_Populationbigintnull
ST_2_0mestringnull
ST_2_Populationbigintnull
ST_3_0mestringnull
ST_3_Populationbigintnull
Imp_Town_1_0mestringnull
Imp_Town_1_Populationbigintnull
Imp_Town_2_0mestringnull
Imp_Town_2_Populationbigintnull
Imp_Town_3_0mestringnull
Imp_Town_3_Populationbigintnull
Total_Inhabited_Villagesbigintnull
Drinking_water_facilitiesbigintnull
Safe_Drinking_waterbigintnull
Electricity_Power_Supply_bigintnull
Electricity_domestic_bigintnull
Electricity_Agriculture_bigintnull
Primary_schoolbigintnull
Middle_schoolsbigintnull
Secondary_Sr_Secondary_schoolsbigintnull
Collegebigintnull
Medical_facilitybigintnull
Primary_Health_Centrebigintnull
Primary_Health_Sub_Centrebigintnull
Post_telegraph_and_telephone_facilitybigintnull
Bus_servicesbigintnull
Paved_approach_roadnull
Mud_approach_roadbigintnull
Permanent_Housefloatnull
Semi_permanent_Housefloatnull
Temporary_Housefloatnull

Table Created in Databricks Environment

Technology Used

  1. Apache Spark
  2. Spark SQL 
  3. DataFrame-based API
  4. Databricks Notebook

Free Account creation in Databricks

Creating a Spark Cluster

Basics about Databricks notebook

Code for Spark SQL to get India's States with Number of Districts

Plot Option for Chart

By Bhavesh