Introduction to Machine Learning
What is Machine Leaning?
A machine developed to be able to learn by itself without user direction. When we use Maching Learning is :
->When working on complex tasks where a deterministic solution is not sufficient
Example: Recognizing speech / images
->When building a personalization system
Example: recommendation and personalization
->When working on tasks that are difficult to track
Example: E.g. automated driving, fraud detection
Implementation
we have case study to learn Machine Learning , let’s find out …
In this case, our goals are load, analyzing, cleans, and modelling. So, we have to ready some libraries like NumPy, Pandas, Matplotlib, Seaborn, and SkLearn. Please install all of the libraries that we need if we never using that library (using pip install ___ ), after that we can import all the libraries we use.
# 1.Importkan Library Numpy, Pandas, Seaborn, Matplotlib, LabelEncorder dari sklearn.preprocessing, train_test_split dari sklearn.model_selection, classification report dan confusion matrix
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
#from sklearn.cluster import KMeans
Data Exploration
If we want to make machine learning, we could not escape data exploration. It makes us understand the data pattern and we get some insight to decide the purpose of the data preprocessing and machine learning model technique. These steps of some exploration dataset:
#access google drive
from google.colab import drive
drive.mount('/content/drive')
# Import Dataset Mushroom from google drive (your own google drive)
df=pd.read_csv("/content/drive/MyDrive/Tugas DS3/mushrooms.csv")
df
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 class 8124 non-null object
1 cap-shape 8124 non-null object
2 cap-surface 8124 non-null object
3 cap-color 8124 non-null object
4 bruises 8124 non-null object
5 odor 8124 non-null object
6 gill-attachment 8124 non-null object
7 gill-spacing 8124 non-null object
8 gill-size 8124 non-null object
9 gill-color 8124 non-null object
10 stalk-shape 8124 non-null object
11 stalk-root 8124 non-null object
12 stalk-surface-above-ring 8124 non-null object
13 stalk-surface-below-ring 8124 non-null object
14 stalk-color-above-ring 8124 non-null object
15 stalk-color-below-ring 8124 non-null object
16 veil-type 8124 non-null object
17 veil-color 8124 non-null object
18 ring-number 8124 non-null object
19 ring-type 8124 non-null object
20 spore-print-color 8124 non-null object
21 population 8124 non-null object
22 habitat 8124 non-null object
dtypes: object(23)
memory usage: 1.4+ MB
We have some insight from data info. There are categorical datasets with all the columns. Okay, we can see the data types of each column and no null values in there(it’s clean bro).
#check visualization for insight what we can done with the datasets
for i in df.columns:
sns.countplot(data=df,x=i)
plt.show()
#check unique value
df.nunique()
class 2
cap-shape 6
cap-surface 4
cap-color 10
bruises 2
odor 9
gill-attachment 2
gill-spacing 2
gill-size 2
gill-color 12
stalk-shape 2
stalk-root 5
stalk-surface-above-ring 4
stalk-surface-below-ring 4
stalk-color-above-ring 9
stalk-color-below-ring 9
veil-type 1
veil-color 4
ring-number 3
ring-type 5
spore-print-color 9
population 6
habitat 7
dtype: int64
Perform the selection of dependent variables and independent variables. With class as the dependent variable. Furthermore, the dependent variable is entered into the y variable and the independent variable is entered into the x variable.
x=df.drop('class',axis=1)
y=df['class']
x.head()
Use the Label Encorder to convert categorical data into Numerical. Why we must change categorical to numerical data because just numerical data that can be Train/Process with Machine Learning.
Encoder_X = LabelEncoder()
for col in x.columns:
x[col] = Encoder_X.fit_transform(x[col])
Encoder_y=LabelEncoder()
y = Encoder_y.fit_transform(y)
x
y
array([1, 0, 0, …, 0, 1, 0])
Do the division of training and testing data with a ratio of 80:20
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
X_train
y_train
array([1, 1, 1, …, 0, 1, 0])
Create a machine learning model using a Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(oob_score=True)
rf.fit(X_train,y_train)
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(solver='lbfgs', max_iter=400)
LR.fit(X_train, y_train)
y_pred_LR = LR.predict(X_test)
print(classification_report(y_test,y_pred_LR))
precision recall f1-score support
0 0.95 0.95 0.95 843
1 0.94 0.95 0.95 782
accuracy 0.95 1625
macro avg 0.95 0.95 0.95 1625
weighted avg 0.95 0.95 0.95 1625
Perform tests on the test data
y_pred=rf.predict(X_test)
y_pred
array([0, 1, 1, …, 1, 1, 1])
Next, evaluate the model using the confusion matrix and classification report
sns.heatmap(confusion_matrix(y_test, y_pred),annot=True);
Classification Report
print(classification_report(y_test,y_pred))
precision recall f1-score support
0 1.00 1.00 1.00 843
1 1.00 1.00 1.00 782
accuracy 1.00 1625
macro avg 1.00 1.00 1.00 1625
weighted avg 1.00 1.00 1.00 1625
Data and Code
Thank’s To : MySkill, Ronny Fahrudin , Priagung Khusumanegara and all my Mentor to help me improve my skill.