Blog - Page 8 of 10 - Projects Based Learning

Classifying gender based on personal preferences Part 1

Apache Spark Machine LearningApache Spark, Apache Spark Machine Learning, Data Visualization, Databricks, Scala, Spark SQL

Gender is a social construct. The way males and females are treated differently since birth moulds their behaviour and personal preferences into what society expects for their gender. This small dataset is designed to provide an idea about whether a person's gender can be predicted with an accuracy significantly above 50% based on their personal preferences. Attribute Information or Dataset Details: FavoriteColor FavoriteMusicGenre FavoriteBeverage FavoriteSoftDrink Gender Technology Used Apache Spark Spark SQL Apache Spark MLLib Scala DataFrame-based API Databricks Notebook Introduction Welcome to this project on Mobile Price Classification in Apache Spark Machine Learning using Databricks platform community edition server…

Classifying gender based on personal preferences Part 2

Apache Spark Machine LearningApache Spark, Apache Spark Machine Learning, Data Visualization, Databricks, Scala, Spark SQL

Collecting all String Columns into an Array %scala var StringfeatureCol = Array("FavoriteColor", "FavoriteMusicGenre", "FavoriteBeverage", "FavoriteSoftDrink", "Gender"); StringIndexer encodes a string column of labels to a column of label indices. Example of StringIndexer %scala import org.apache.spark.ml.feature.StringIndexer val df = spark.createDataFrame( Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")) ).toDF("id", "category") df.show() val indexer = new StringIndexer() .setInputCol("category") .setOutputCol("categoryIndex") val indexed = indexer.fit(df).transform(df) indexed.show() Output: +---+--------+ | id|category| +---+--------+ | 0| a| | 1| b| | 2| c| | 3| a| | 4| a| | 5| c| +---+--------+ +---+--------+-------------+ | id|category|categoryIndex| +---+--------+-------------+ | 0| a| 0.0| | 1|…

Mobile Price Classification

Apache Spark Machine LearningApache Spark, Apache Spark Machine Learning, Data Visualization, Databricks, Scala, Spark SQL

Machine Learning Project for Mobile Price Classification based on the available attributes Problem Statement or Business Problem Bob has started his own mobile company. He wants to give a tough fight to big companies like Apple, Samsung etc. He does not know how to estimate the price of mobiles his company creates. In this competitive mobile phone market, you cannot simply assume things. To solve this problem he collects sales data of mobile phones of various companies. Bob wants to find out some relation between features of a mobile phone(eg:- RAM, Internal Memory, etc) and its selling price. But he…

Predicting the Cellular Localization Sites of Proteins in Yest

Apache Spark Machine LearningApache Spark, Apache Spark Machine Learning, Data Visualization, Databricks, Scala, Spark SQL

Machine Learning Project Predicting the Cellular Localization Sites of Proteins in Yest based on the available attributes Data Set Information Sequence Name: Accession number for the SWISS-PROT database mcg: McGeoch's method for signal sequence recognition. gvh: von Heijne's method for signal sequence recognition. alm: Score of the ALOM membrane spanning region prediction program. mit: Score of discriminant analysis of the amino acid content of the N-terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins. erl: Presence of "HDEL" substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute. pox: Peroxisomal targeting signal in…

YouTube Spam Comment Prediction

Apache Spark Machine LearningApache Spark, Apache Spark Machine Learning, Data Visualization, Databricks, Scala, Spark SQL

Machine Learning Project YouTube Spam Comment Prediction. The study of the classification YouTube comment as spam based on the available attributes Data Set Information COMMENT_ID: String AUTHOR: String DATE: String CONTENT: String CLASS: Double Technology Used Apache Spark Spark SQL Apache Spark MLLib Scala DataFrame-based API Databricks Notebook Challenges Process Comma-separated values file (ie file with .csv as Extensions) with user define a schema for data Convert String data to Numeric format so we can process the data in Apache Spark ML Library. Introduction Welcome to this project on creating prediction model to Identify spam comment in Apache Spark Machine…

Identify the Type of animal (7 Types) based on the available attributes

Apache Spark Machine LearningApache Spark, Apache Spark Machine Learning, Data Visualization, Databricks, Scala, Spark SQL

Machine Learning Project Animal Classification. The study of the classification of types of animal, Identify the Type of animal (7 Types) based on the available attributes Data Set Information A simple database containing 17 Boolean-valued attributes. The "type" attribute appears to be the class attribute. Here is a breakdown of which animals are in which type: Class# -- Set of animals:====== ====================================================1 -- (41) aardvark, antelope, bear, boar, buffalo, calf, cavy, cheetah, deer, dolphin, elephant, fruitbat, giraffe, girl, goat, gorilla, hamster, hare, leopard, lion, lynx, mink, mole, mongoose, opossum, oryx, platypus, polecat, pony, porpoise, puma, pussycat, raccoon, reindeer, seal, sealion,…

Glass Identification

Apache Spark Machine LearningApache Spark, Apache Spark Machine Learning, Data Visualization, Databricks, Scala, Spark SQL

Machine Learning Project for Glass Identification. Problem Statement or Business Problem From USA Forensic Science Service; 6 types of glass; defined in terms of their oxide content (i.e. Na, Fe, K, etc) The study of the classification of types of glass was motivated by the criminological investigation. At the scene of the crime, the glass left can be used as evidence...if it is correctly identified! Attribute Information or Dataset Details: Id number: 1 to 214 RI: refractive index Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10) Mg: Magnesium Al: Aluminum Si: Silicon K: Potassium Ca:…

Predicting the age of abalone from physical measurements Part 1

Apache Spark Machine LearningApache Spark, Apache Spark Machine Learning, Data Visualization, Databricks, Scala, Spark SQL

Machine Learning Project Predicting the age of abalone from physical measurements. Abalone is a common name for any of a group of small to very large sea snails, marine gastropod molluscs in the family Haliotidae. Other common names are ear shells, sea ears, and muttonfish or muttonshells in Australia, ormer in the UK, perlemoen in South Africa, and paua in New Zealand. Abalone are marine snails. Problem Statement or Business Problem Predict the age of abalone from physical measurements Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone,…

Predicting the age of abalone from physical measurements Part 2

Apache Spark Machine LearningApache Spark, Apache Spark Machine Learning, Data Visualization, Databricks, Scala, Spark SQL

Histogram for Sex and Age %sql select Sex, Age from AbaloneData; Plot Option Age Distribution %sql select count(Sex), Sex from AbaloneData group by Sex; Histogram for Lenght in mm in Abalone %sql select Length_in_mm from AbaloneData; Histogram for Height in mm in Abalone %sql select Height_in_mm from AbaloneData; Histogram for rings in Abalone %sql select Rings from AbaloneData; Creating a Regression Model In this tutorial , you will implement a regression model that uses features of abalone to predict the age of abalone from physical measurements Import Spark SQL and Spark ML Libraries First, import the libraries you will…

Customer Complaints Analysis Part 1

Bigdata HadoopApache Hadoop, Apache Pig, Bigdata, Linux, Shell Script

In this article, We will analyze Consumer Complains recorded by US government from US citizens about financial products and services using Big Data Technology, We will see step by step process execution of the project. Problem Statement: Analyze the data in Hadoop Eco-system to: Get the number of complaints filed for each company.Get the number of complaints filed under each product.Get the total number of complaints filed from a particular locationGet the list of company grouped by location which has no timely response. Attribute Information or Dataset Details: Data: Input Format - .CSV Public DATASET available at below website https://catalog.data.gov/dataset/consumer-complaint-database…