A mushroom, or toadstool, is the fleshy, spore-bearing fruiting body of a fungus, typically produced above ground on soil or on its food source.
Problem Statement or Business Problem
In this project, looking at the various properties of a mushroom, we will predict whether the mushroom is edible or poisonous.
Attribute Information or Dataset Details:
To be more understandable, let’s write properties one by one.
- classes: edible=e, poisonous=p
- cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
- cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
- cap-color: brown=n, buff=b, cinnamon=c, gray=g,green=r, pink=p, purple=u, red=e,white=w,yellow=y
- bruises: bruises=t,no=f
- odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p,spicy=s
- gill-attachment: attached=a,descending=d,free=f,notched=n
- gill-spacing: close=c,crowded=w,distant=d
- gill-size: broad=b,narrow=n
- gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r, orange=o, pink=p,purple=u,red=e,white=w,yellow=y
- stalk-shape: enlarging=e,tapering=t
- stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r,missing=?
- stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
- stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
- stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e,white=w,yellow=y
- stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e,white=w,yellow=y
- veil-type: partial=p,universal=u
- veil-color: brown=n,orange=o,white=w,yellow=y
- ring-number: none=n,one=o,two=t
- ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s,zone=z
- spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u,white=w,yellow=y
- population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
- habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d
Also, following image shows mushroom parts as we mentioned above. (Image Credit goes to Infovisual)
- Cap: The cap is the top of the mushroom (and often looks sort of like a small umbrella). Mushroom caps can come in a variety of colors but most often are brown, white, or yellow.
- Gills, Pores, or Teeth: These structures appear under the mushroom’s cap. They look similar to a fish’s gills.
- Ring: The ring (sometimes called the annulus) is the remaining structure of the partial veil after the gills have pushed through.
- Stem or Stipe: The stem is the tall structure that holds the cap high above the ground.
- Volva: The volva is the protective veil that remains after the mushroom sprouted up from the ground. As the fungus grows, it breaks through the volva.
- Spores: Microscopic seeds acting as reproductive agents; they are usually released into the air and fall on a substrate to produce a new mushroom.
Technology Used
- Apache Spark
- Spark SQL
- Apache Spark MLLib
- Scala
- DataFrame-based API
- Databricks Notebook
Introduction
Welcome to this project on predict whether mushroom is edible or poisonous in Apache Spark Machine Learning using Databricks platform community edition server which allows you to execute your spark code, free of cost on their server just by registering through email id.
In this project, we explore Apache Spark and Machine Learning on the Databricks platform.
I am a firm believer that the best way to learn is by doing. That’s why I haven’t included any purely theoretical lectures in this tutorial: you will learn everything on the way and be able to put it into practice straight away. Seeing the way each feature works will help you learn Apache Spark machine learning thoroughly by heart.
We’re going to look at how to set up a Spark Cluster and get started with that. And we’ll look at how we can then use that Spark Cluster to take data coming into that Spark Cluster, a process that data using a Machine Learning model, and generate some sort of output in the form of a prediction. That’s pretty much what we’re going to learn about the predictive model.
In this project, we will be performing prediction where mushroom are edible or poisonous.
We will learn:
Preparing the Data for Processing.
Basics flow of data in Apache Spark, loading data, and working with data, this course shows you how Apache Spark is perfect for a Machine Learning job.
Learn the basics of Databricks notebook by enrolling in Free Community Edition Server
Define the Machine Learning Pipeline
Train a Machine Learning Model
Testing a Machine Learning Model
Evaluating a Machine Learning Model (i.e. Examine the Predicted and Actual Values)
The goal is to provide you with practical tools that will be beneficial for you in the future. While doing that, you’ll develop a model with a real use opportunity.
I am really excited you are here, I hope you are going to follow all the way to the end of the Project. It is fairly straight forward fairly easy to follow through the article we will show you step by step each line of code & we will explain what it does and why we are doing it.
Free Account creation in Databricks
Creating a Spark Cluster
Basics about Databricks notebook
Loading Data into Databricks Environment
Download Data
Load Data in Dataframe using User-defined Schema
%scala val mushroom = sqlContext.read.format("csv") .option("header", "true") .option("inferSchema", "true") .option("delimiter", ",") .load("/FileStore/tables/mushrooms-1.csv") mushroom.show() +-----+--------+----------+--------+-------+----+--------------+-----------+--------+---------+----------+---------+---------------------+---------------------+-------------------+-------------------+--------+---------+----------+--------+---------------+----------+-------+ |class|capshape|capsurface|capcolor|bruises|odor|gillattachment|gillspacing|gillsize|gillcolor|stalkshape|stalkroot|stalksurfaceabovering|stalksurfacebelowring|stalkcolorabovering|stalkcolorbelowring|veiltype|veilcolor|ringnumber|ringtype|sporeprintcolor|population|habitat| +-----+--------+----------+--------+-------+----+--------------+-----------+--------+---------+----------+---------+---------------------+---------------------+-------------------+-------------------+--------+---------+----------+--------+---------------+----------+-------+ | p| x| s| n| t| p| f| c| n| k| e| e| s| s| w| w| p| w| o| p| k| s| u| | e| x| s| y| t| a| f| c| b| k| e| c| s| s| w| w| p| w| o| p| n| n| g| | e| b| s| w| t| l| f| c| b| n| e| c| s| s| w| w| p| w| o| p| n| n| m| | p| x| y| w| t| p| f| c| n| n| e| e| s| s| w| w| p| w| o| p| k| s| u| | e| x| s| g| f| n| f| w| b| k| t| e| s| s| w| w| p| w| o| e| n| a| g| | e| x| y| y| t| a| f| c| b| n| e| c| s| s| w| w| p| w| o| p| k| n| g| | e| b| s| w| t| a| f| c| b| g| e| c| s| s| w| w| p| w| o| p| k| n| m| | e| b| y| w| t| l| f| c| b| n| e| c| s| s| w| w| p| w| o| p| n| s| m| | p| x| y| w| t| p| f| c| n| p| e| e| s| s| w| w| p| w| o| p| k| v| g| | e| b| s| y| t| a| f| c| b| g| e| c| s| s| w| w| p| w| o| p| k| s| m| | e| x| y| y| t| l| f| c| b| g| e| c| s| s| w| w| p| w| o| p| n| n| g| | e| x| y| y| t| a| f| c| b| n| e| c| s| s| w| w| p| w| o| p| k| s| m| | e| b| s| y| t| a| f| c| b| w| e| c| s| s| w| w| p| w| o| p| n| s| g| | p| x| y| w| t| p| f| c| n| k| e| e| s| s| w| w| p| w| o| p| n| v| u| | e| x| f| n| f| n| f| w| b| n| t| e| s| f| w| w| p| w| o| e| k| a| g| | e| s| f| g| f| n| f| c| n| k| e| e| s| s| w| w| p| w| o| p| n| y| u| | e| f| f| w| f| n| f| w| b| k| t| e| s| s| w| w| p| w| o| e| n| a| g| | p| x| s| n| t| p| f| c| n| n| e| e| s| s| w| w| p| w| o| p| k| s| g| | p| x| y| w| t| p| f| c| n| n| e| e| s| s| w| w| p| w| o| p| n| s| u| | p| x| s| n| t| p| f| c| n| k| e| e| s| s| w| w| p| w| o| p| n| s| u| +-----+--------+----------+--------+-------+----+--------------+-----------+--------+---------+----------+---------+---------------------+---------------------+-------------------+-------------------+--------+---------+----------+--------+---------------+----------+-------+ only showing top 20 rows
Print Schema of Dataframe
%scala mushroom.printSchema(); Output: root |-- class: string (nullable = true) |-- capshape: string (nullable = true) |-- capsurface: string (nullable = true) |-- capcolor: string (nullable = true) |-- bruises: string (nullable = true) |-- odor: string (nullable = true) |-- gillattachment: string (nullable = true) |-- gillspacing: string (nullable = true) |-- gillsize: string (nullable = true) |-- gillcolor: string (nullable = true) |-- stalkshape: string (nullable = true) |-- stalkroot: string (nullable = true) |-- stalksurfaceabovering: string (nullable = true) |-- stalksurfacebelowring: string (nullable = true) |-- stalkcolorabovering: string (nullable = true) |-- stalkcolorbelowring: string (nullable = true) |-- veiltype: string (nullable = true) |-- veilcolor: string (nullable = true) |-- ringnumber: string (nullable = true) |-- ringtype: string (nullable = true) |-- sporeprintcolor: string (nullable = true) |-- population: string (nullable = true) |-- habitat: string (nullable = true)
Statistics of Data
%scala mushroom.describe().show() Output: +-------+-----+--------+----------+--------+-------+----+--------------+-----------+--------+---------+----------+---------+---------------------+---------------------+-------------------+-------------------+--------+---------+----------+--------+---------------+----------+-------+ |summary|class|capshape|capsurface|capcolor|bruises|odor|gillattachment|gillspacing|gillsize|gillcolor|stalkshape|stalkroot|stalksurfaceabovering|stalksurfacebelowring|stalkcolorabovering|stalkcolorbelowring|veiltype|veilcolor|ringnumber|ringtype|sporeprintcolor|population|habitat| +-------+-----+--------+----------+--------+-------+----+--------------+-----------+--------+---------+----------+---------+---------------------+---------------------+-------------------+-------------------+--------+---------+----------+--------+---------------+----------+-------+ | count| 8124| 8124| 8124| 8124| 8124|8124| 8124| 8124| 8124| 8124| 8124| 8124| 8124| 8124| 8124| 8124| 8124| 8124| 8124| 8124| 8124| 8124| 8124| | mean| null| null| null| null| null|null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| | stddev| null| null| null| null| null|null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| null| | min| e| b| f| b| f| a| a| c| b| b| e| ?| f| f| b| b| p| n| n| e| b| a| d| | max| p| x| y| y| t| y| f| w| n| y| t| r| y| y| y| y| p| y| t| p| y| y| w| +-------+-----+--------+----------+--------+-------+----+--------------+-----------+--------+---------+----------+---------+---------------------+---------------------+-------------------+-------------------+--------+---------+----------+--------+---------------+----------+-------+
Create Temporary View so we can perform Spark SQL on Data
%scala mushroom.createOrReplaceTempView("MushroomData")
Spark SQL
%sql select * from MushroomData;
Exploratory Data Analysis or EDA
Bruises Counts with Mushroom Types
%sql select count(class), CASE WHEN class = "e" THEN "Edible" ELSE "Poisonous" END AS CLASSES, bruises from MushroomData group by CLASSES, bruises;
Mushroom Cap Color Quantity
%sql select count(capcolor), CASE WHEN capcolor = "n" THEN "Brown" WHEN capcolor = "b" THEN "Buff" WHEN capcolor = "c" THEN "Cinnamon" WHEN capcolor = "g" THEN "Gray" WHEN capcolor = "r" THEN "Green" WHEN capcolor = "p" THEN "Pink" WHEN capcolor = "u" THEN "Purple" WHEN capcolor = "e" THEN "Red" WHEN capcolor = "w" THEN "White" ELSE "Yellow" END AS ColorOfCap from MushroomData group by capcolor order by count(capcolor) desc;
Edible and Poisonous Mushrooms Based on Cap Color
%sql select count(capcolor), CASE WHEN class = "e" THEN "Edible" ELSE "Poisonous" END AS CLASSES, CASE WHEN capcolor = "n" THEN "Brown" WHEN capcolor = "b" THEN "Buff" WHEN capcolor = "c" THEN "Cinnamon" WHEN capcolor = "g" THEN "Gray" WHEN capcolor = "r" THEN "Green" WHEN capcolor = "p" THEN "Pink" WHEN capcolor = "u" THEN "Purple" WHEN capcolor = "e" THEN "Red" WHEN capcolor = "w" THEN "White" ELSE "Yellow" END AS ColorOfCap from MushroomData group by capcolor,class order by count(capcolor) desc;
Mushroom Odor and Quantity
%sql select count(odor), CASE WHEN odor = "a" THEN "almond" WHEN odor = "l" THEN "anise" WHEN odor = "c" THEN "creosote" WHEN odor = "y" THEN "fishy" WHEN odor = "f" THEN "foul" WHEN odor = "m" THEN "musty" WHEN odor = "n" THEN "none" WHEN odor = "p" THEN "pungent" ELSE "spicy" END AS odor from MushroomData group by odor order by count(odor) desc;
Edible and Poisonous Mushrooms Based on Odor
%sql select count(odor), CASE WHEN class = "e" THEN "Edible" ELSE "Poisonous" END AS CLASSES, CASE WHEN odor = "a" THEN "almond" WHEN odor = "l" THEN "anise" WHEN odor = "c" THEN "creosote" WHEN odor = "y" THEN "fishy" WHEN odor = "f" THEN "foul" WHEN odor = "m" THEN "musty" WHEN odor = "n" THEN "none" WHEN odor = "p" THEN "pungent" ELSE "spicy" END AS odor from MushroomData group by odor, class order by count(odor) desc;
Mushroom Population Type Percentage
%sql select count(population), CASE WHEN population = "a" THEN "abundant" WHEN population = "c" THEN "clustered" WHEN population = "n" THEN "numerous" WHEN population = "s" THEN "scattered" WHEN population = "v" THEN "several" ELSE "solitary" END AS Population from MushroomData group by Population;
Edible & Poisonous Mushroom Population Type Percentage
%sql select count(population), CASE WHEN class = "e" THEN "Edible" ELSE "Poisonous" END AS CLASSES, CASE WHEN population = "a" THEN "abundant" WHEN population = "c" THEN "clustered" WHEN population = "n" THEN "numerous" WHEN population = "s" THEN "scattered" WHEN population = "v" THEN "several" ELSE "solitary" END AS Population from MushroomData group by Population, class;
Mushroom Habitat Type Percentage
%sql select count(habitat), CASE WHEN habitat = "g" THEN "grasses" WHEN habitat = "l" THEN "leaves" WHEN habitat = "m" THEN "meadows" WHEN habitat = "p" THEN "paths" WHEN habitat = "u" THEN "urban" WHEN habitat = "w" THEN "waste" ELSE "wood" END AS Habitat from MushroomData group by habitat
Edible & Poisonous Mushroom Habitat Type Percentage
%sql select count(habitat), CASE WHEN class = "e" THEN "Edible" ELSE "Poisonous" END AS CLASSES, CASE WHEN habitat = "g" THEN "grasses" WHEN habitat = "l" THEN "leaves" WHEN habitat = "m" THEN "meadows" WHEN habitat = "p" THEN "paths" WHEN habitat = "u" THEN "urban" WHEN habitat = "w" THEN "waste" ELSE "wood" END AS Habitat from MushroomData group by habitat, class