Machine Learning Project on Mushroom Classification whether it’s edible or poisonous Part 1

A mushroom, or toadstool, is the fleshy, spore-bearing fruiting body of a fungus, typically produced above ground on soil or on its food source.

Problem Statement or Business Problem

In this project, looking at the various properties of a mushroom, we will predict whether the mushroom is edible or poisonous.

Attribute Information or Dataset Details:

To be more understandable, let’s write properties one by one.

  • classes: edible=e, poisonous=p
  • cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
  • cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
  • cap-color: brown=n, buff=b, cinnamon=c, gray=g,green=r, pink=p, purple=u, red=e,white=w,yellow=y
  • bruises: bruises=t,no=f
  • odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p,spicy=s
  • gill-attachment: attached=a,descending=d,free=f,notched=n
  • gill-spacing: close=c,crowded=w,distant=d
  • gill-size: broad=b,narrow=n
  • gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r, orange=o, pink=p,purple=u,red=e,white=w,yellow=y
  • stalk-shape: enlarging=e,tapering=t
  • stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r,missing=?
  • stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
  • stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
  • stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e,white=w,yellow=y
  • stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e,white=w,yellow=y
  • veil-type: partial=p,universal=u
  • veil-color: brown=n,orange=o,white=w,yellow=y
  • ring-number: none=n,one=o,two=t
  • ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s,zone=z
  • spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u,white=w,yellow=y
  • population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
  • habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

Also, following image shows mushroom parts as we mentioned above. (Image Credit goes to Infovisual)

  • Cap: The cap is the top of the mushroom (and often looks sort of like a small umbrella). Mushroom caps can come in a variety of colors but most often are brown, white, or yellow.
  • Gills, Pores, or Teeth: These structures appear under the mushroom’s cap. They look similar to a fish’s gills.
  • Ring: The ring (sometimes called the annulus) is the remaining structure of the partial veil after the gills have pushed through.
  • Stem or Stipe: The stem is the tall structure that holds the cap high above the ground.
  • Volva: The volva is the protective veil that remains after the mushroom sprouted up from the ground. As the fungus grows, it breaks through the volva.
  • Spores: Microscopic seeds acting as reproductive agents; they are usually released into the air and fall on a substrate to produce a new mushroom.

Technology Used

  1. Apache Spark
  2. Spark SQL
  3. Apache Spark MLLib
  4. Scala
  5. DataFrame-based API
  6. Databricks Notebook

Introduction

Welcome to this project on predict whether mushroom is edible or poisonous in Apache Spark Machine Learning using Databricks platform community edition server which allows you to execute your spark code, free of cost on their server just by registering through email id.

In this project, we explore Apache Spark and Machine Learning on the Databricks platform.

I am a firm believer that the best way to learn is by doing. That’s why I haven’t included any purely theoretical lectures in this tutorial: you will learn everything on the way and be able to put it into practice straight away. Seeing the way each feature works will help you learn Apache Spark machine learning thoroughly by heart.

We’re going to look at how to set up a Spark Cluster and get started with that. And we’ll look at how we can then use that Spark Cluster to take data coming into that Spark Cluster, a process that data using a Machine Learning model, and generate some sort of output in the form of a prediction. That’s pretty much what we’re going to learn about the predictive model.

In this project, we will be performing prediction where mushroom are edible or poisonous.

We will learn:

Preparing the Data for Processing.
Basics flow of data in Apache Spark, loading data, and working with data, this course shows you how Apache Spark is perfect for a Machine Learning job.
Learn the basics of Databricks notebook by enrolling in Free Community Edition Server
Define the Machine Learning Pipeline
Train a Machine Learning Model
Testing a Machine Learning Model
Evaluating a Machine Learning Model (i.e. Examine the Predicted and Actual Values)
The goal is to provide you with practical tools that will be beneficial for you in the future. While doing that, you’ll develop a model with a real use opportunity.

I am really excited you are here, I hope you are going to follow all the way to the end of the Project. It is fairly straight forward fairly easy to follow through the article we will show you step by step each line of code & we will explain what it does and why we are doing it.

Free Account creation in Databricks

Creating a Spark Cluster

Basics about Databricks notebook

Loading Data into Databricks Environment

Download Data

Load Data in Dataframe using User-defined Schema

%scala

val mushroom = sqlContext.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .option("delimiter", ",")
  .load("/FileStore/tables/mushrooms-1.csv")

mushroom.show()

+-----+--------+----------+--------+-------+----+--------------+-----------+--------+---------+----------+---------+---------------------+---------------------+-------------------+-------------------+--------+---------+----------+--------+---------------+----------+-------+
|class|capshape|capsurface|capcolor|bruises|odor|gillattachment|gillspacing|gillsize|gillcolor|stalkshape|stalkroot|stalksurfaceabovering|stalksurfacebelowring|stalkcolorabovering|stalkcolorbelowring|veiltype|veilcolor|ringnumber|ringtype|sporeprintcolor|population|habitat|
+-----+--------+----------+--------+-------+----+--------------+-----------+--------+---------+----------+---------+---------------------+---------------------+-------------------+-------------------+--------+---------+----------+--------+---------------+----------+-------+
|    p|       x|         s|       n|      t|   p|             f|          c|       n|        k|         e|        e|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              k|         s|      u|
|    e|       x|         s|       y|      t|   a|             f|          c|       b|        k|         e|        c|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              n|         n|      g|
|    e|       b|         s|       w|      t|   l|             f|          c|       b|        n|         e|        c|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              n|         n|      m|
|    p|       x|         y|       w|      t|   p|             f|          c|       n|        n|         e|        e|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              k|         s|      u|
|    e|       x|         s|       g|      f|   n|             f|          w|       b|        k|         t|        e|                    s|                    s|                  w|                  w|       p|        w|         o|       e|              n|         a|      g|
|    e|       x|         y|       y|      t|   a|             f|          c|       b|        n|         e|        c|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              k|         n|      g|
|    e|       b|         s|       w|      t|   a|             f|          c|       b|        g|         e|        c|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              k|         n|      m|
|    e|       b|         y|       w|      t|   l|             f|          c|       b|        n|         e|        c|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              n|         s|      m|
|    p|       x|         y|       w|      t|   p|             f|          c|       n|        p|         e|        e|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              k|         v|      g|
|    e|       b|         s|       y|      t|   a|             f|          c|       b|        g|         e|        c|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              k|         s|      m|
|    e|       x|         y|       y|      t|   l|             f|          c|       b|        g|         e|        c|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              n|         n|      g|
|    e|       x|         y|       y|      t|   a|             f|          c|       b|        n|         e|        c|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              k|         s|      m|
|    e|       b|         s|       y|      t|   a|             f|          c|       b|        w|         e|        c|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              n|         s|      g|
|    p|       x|         y|       w|      t|   p|             f|          c|       n|        k|         e|        e|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              n|         v|      u|
|    e|       x|         f|       n|      f|   n|             f|          w|       b|        n|         t|        e|                    s|                    f|                  w|                  w|       p|        w|         o|       e|              k|         a|      g|
|    e|       s|         f|       g|      f|   n|             f|          c|       n|        k|         e|        e|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              n|         y|      u|
|    e|       f|         f|       w|      f|   n|             f|          w|       b|        k|         t|        e|                    s|                    s|                  w|                  w|       p|        w|         o|       e|              n|         a|      g|
|    p|       x|         s|       n|      t|   p|             f|          c|       n|        n|         e|        e|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              k|         s|      g|
|    p|       x|         y|       w|      t|   p|             f|          c|       n|        n|         e|        e|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              n|         s|      u|
|    p|       x|         s|       n|      t|   p|             f|          c|       n|        k|         e|        e|                    s|                    s|                  w|                  w|       p|        w|         o|       p|              n|         s|      u|
+-----+--------+----------+--------+-------+----+--------------+-----------+--------+---------+----------+---------+---------------------+---------------------+-------------------+-------------------+--------+---------+----------+--------+---------------+----------+-------+
only showing top 20 rows

Print Schema of Dataframe

%scala

mushroom.printSchema();

Output:

root
 |-- class: string (nullable = true)
 |-- capshape: string (nullable = true)
 |-- capsurface: string (nullable = true)
 |-- capcolor: string (nullable = true)
 |-- bruises: string (nullable = true)
 |-- odor: string (nullable = true)
 |-- gillattachment: string (nullable = true)
 |-- gillspacing: string (nullable = true)
 |-- gillsize: string (nullable = true)
 |-- gillcolor: string (nullable = true)
 |-- stalkshape: string (nullable = true)
 |-- stalkroot: string (nullable = true)
 |-- stalksurfaceabovering: string (nullable = true)
 |-- stalksurfacebelowring: string (nullable = true)
 |-- stalkcolorabovering: string (nullable = true)
 |-- stalkcolorbelowring: string (nullable = true)
 |-- veiltype: string (nullable = true)
 |-- veilcolor: string (nullable = true)
 |-- ringnumber: string (nullable = true)
 |-- ringtype: string (nullable = true)
 |-- sporeprintcolor: string (nullable = true)
 |-- population: string (nullable = true)
 |-- habitat: string (nullable = true)

Statistics of Data

%scala

mushroom.describe().show()

Output:

+-------+-----+--------+----------+--------+-------+----+--------------+-----------+--------+---------+----------+---------+---------------------+---------------------+-------------------+-------------------+--------+---------+----------+--------+---------------+----------+-------+
|summary|class|capshape|capsurface|capcolor|bruises|odor|gillattachment|gillspacing|gillsize|gillcolor|stalkshape|stalkroot|stalksurfaceabovering|stalksurfacebelowring|stalkcolorabovering|stalkcolorbelowring|veiltype|veilcolor|ringnumber|ringtype|sporeprintcolor|population|habitat|
+-------+-----+--------+----------+--------+-------+----+--------------+-----------+--------+---------+----------+---------+---------------------+---------------------+-------------------+-------------------+--------+---------+----------+--------+---------------+----------+-------+
|  count| 8124|    8124|      8124|    8124|   8124|8124|          8124|       8124|    8124|     8124|      8124|     8124|                 8124|                 8124|               8124|               8124|    8124|     8124|      8124|    8124|           8124|      8124|   8124|
|   mean| null|    null|      null|    null|   null|null|          null|       null|    null|     null|      null|     null|                 null|                 null|               null|               null|    null|     null|      null|    null|           null|      null|   null|
| stddev| null|    null|      null|    null|   null|null|          null|       null|    null|     null|      null|     null|                 null|                 null|               null|               null|    null|     null|      null|    null|           null|      null|   null|
|    min|    e|       b|         f|       b|      f|   a|             a|          c|       b|        b|         e|        ?|                    f|                    f|                  b|                  b|       p|        n|         n|       e|              b|         a|      d|
|    max|    p|       x|         y|       y|      t|   y|             f|          w|       n|        y|         t|        r|                    y|                    y|                  y|                  y|       p|        y|         t|       p|              y|         y|      w|
+-------+-----+--------+----------+--------+-------+----+--------------+-----------+--------+---------+----------+---------+---------------------+---------------------+-------------------+-------------------+--------+---------+----------+--------+---------------+----------+-------+

Create Temporary View so we can perform Spark SQL on Data

%scala

mushroom.createOrReplaceTempView("MushroomData")

Spark SQL

%sql

select * from MushroomData;

Exploratory Data Analysis or EDA​

Bruises Counts with Mushroom Types

%sql

select count(class), 
CASE 
	WHEN class = "e" THEN "Edible"
	ELSE "Poisonous"
END AS CLASSES,
bruises from MushroomData group by CLASSES, bruises;  

Mushroom Cap Color Quantity

%sql

select count(capcolor), 
CASE 
	WHEN capcolor = "n" THEN "Brown"
	WHEN capcolor = "b" THEN "Buff"
	WHEN capcolor = "c" THEN "Cinnamon"
	WHEN capcolor = "g" THEN "Gray"
	WHEN capcolor = "r" THEN "Green"
	WHEN capcolor = "p" THEN "Pink"
	WHEN capcolor = "u" THEN "Purple"
	WHEN capcolor = "e" THEN "Red"
	WHEN capcolor = "w" THEN "White"
	ELSE "Yellow"
END AS ColorOfCap 
from MushroomData group by capcolor order by count(capcolor) desc;

Edible and Poisonous Mushrooms Based on Cap Color

%sql

select count(capcolor),
CASE 
	WHEN class = "e" THEN "Edible"
	ELSE "Poisonous"
END AS CLASSES,
CASE 
	WHEN capcolor = "n" THEN "Brown"
	WHEN capcolor = "b" THEN "Buff"
	WHEN capcolor = "c" THEN "Cinnamon"
	WHEN capcolor = "g" THEN "Gray"
	WHEN capcolor = "r" THEN "Green"
	WHEN capcolor = "p" THEN "Pink"
	WHEN capcolor = "u" THEN "Purple"
	WHEN capcolor = "e" THEN "Red"
	WHEN capcolor = "w" THEN "White"
	ELSE "Yellow"
END AS ColorOfCap 
from MushroomData group by capcolor,class order by count(capcolor) desc;

Mushroom Odor and Quantity

%sql

select count(odor), 
CASE 
	WHEN odor = "a" THEN "almond"
	WHEN odor = "l" THEN "anise"
	WHEN odor = "c" THEN "creosote"
	WHEN odor = "y" THEN "fishy"
	WHEN odor = "f" THEN "foul"
	WHEN odor = "m" THEN "musty"
	WHEN odor = "n" THEN "none"
	WHEN odor = "p" THEN "pungent"
	ELSE "spicy"
END AS odor 
from MushroomData group by odor order by count(odor) desc;  

Edible and Poisonous Mushrooms Based on Odor

%sql

select count(odor), 
CASE 
	WHEN class = "e" THEN "Edible"
	ELSE "Poisonous"
END AS CLASSES,
CASE 
	WHEN odor = "a" THEN "almond"
	WHEN odor = "l" THEN "anise"
	WHEN odor = "c" THEN "creosote"
	WHEN odor = "y" THEN "fishy"
	WHEN odor = "f" THEN "foul"
	WHEN odor = "m" THEN "musty"
	WHEN odor = "n" THEN "none"
	WHEN odor = "p" THEN "pungent"
	ELSE "spicy"
END AS odor 
from MushroomData group by odor, class order by count(odor) desc;  

Mushroom Population Type Percentage

%sql

select count(population), 
CASE 
	WHEN population = "a" THEN "abundant"
	WHEN population = "c" THEN "clustered"
	WHEN population = "n" THEN "numerous"
	WHEN population = "s" THEN "scattered"
	WHEN population = "v" THEN "several"
	ELSE "solitary"
END AS Population 
from MushroomData group by Population; 

Edible & Poisonous Mushroom Population Type Percentage

%sql

select count(population), 
CASE 
	WHEN class = "e" THEN "Edible"
	ELSE "Poisonous"
END AS CLASSES,
CASE 
	WHEN population = "a" THEN "abundant"
	WHEN population = "c" THEN "clustered"
	WHEN population = "n" THEN "numerous"
	WHEN population = "s" THEN "scattered"
	WHEN population = "v" THEN "several"
	ELSE "solitary"
END AS Population 
from MushroomData group by Population, class;  

Mushroom Habitat Type Percentage

%sql

select count(habitat), 
CASE 
	WHEN habitat = "g" THEN "grasses"
	WHEN habitat = "l" THEN "leaves"
	WHEN habitat = "m" THEN "meadows"
	WHEN habitat = "p" THEN "paths"
	WHEN habitat = "u" THEN "urban"
	WHEN habitat = "w" THEN "waste"
	ELSE "wood"
END AS Habitat
from MushroomData group by habitat

Edible & Poisonous Mushroom Habitat Type Percentage

%sql

select count(habitat), 
CASE 
	WHEN class = "e" THEN "Edible"
	ELSE "Poisonous"
END AS CLASSES,
CASE 
	WHEN habitat = "g" THEN "grasses"
	WHEN habitat = "l" THEN "leaves"
	WHEN habitat = "m" THEN "meadows"
	WHEN habitat = "p" THEN "paths"
	WHEN habitat = "u" THEN "urban"
	WHEN habitat = "w" THEN "waste"
	ELSE "wood"
END AS Habitat
from MushroomData group by habitat, class
By Bhavesh