Introduction to Machine Learning

5 min readFeb 6, 2023

What is Machine Leaning?

A machine developed to be able to learn by itself without user direction. When we use Maching Learning is :

->When working on complex tasks where a deterministic solution is not sufficient
Example: Recognizing speech / images
->When building a personalization system
Example: recommendation and personalization
->When working on tasks that are difficult to track
Example: E.g. automated driving, fraud detection

Implementation

we have case study to learn Machine Learning , let’s find out …

In this case, our goals are load, analyzing, cleans, and modelling. So, we have to ready some libraries like NumPy, Pandas, Matplotlib, Seaborn, and SkLearn. Please install all of the libraries that we need if we never using that library (using pip install ___ ), after that we can import all the libraries we use.

# 1.Importkan Library Numpy, Pandas, Seaborn, Matplotlib, LabelEncorder dari sklearn.preprocessing, train_test_split dari sklearn.model_selection, classification report dan confusion matrix 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
#from sklearn.cluster import KMeans

Data Exploration

If we want to make machine learning, we could not escape data exploration. It makes us understand the data pattern and we get some insight to decide the purpose of the data preprocessing and machine learning model technique. These steps of some exploration dataset:

#access google drive
from google.colab import drive
drive.mount('/content/drive')

# Import Dataset Mushroom from google drive (your own google drive)
df=pd.read_csv("/content/drive/MyDrive/Tugas DS3/mushrooms.csv")
df

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring    8124 non-null   object
 15  stalk-color-below-ring    8124 non-null   object
 16  veil-type                 8124 non-null   object
 17  veil-color                8124 non-null   object
 18  ring-number               8124 non-null   object
 19  ring-type                 8124 non-null   object
 20  spore-print-color         8124 non-null   object
 21  population                8124 non-null   object
 22  habitat                   8124 non-null   object
dtypes: object(23)
memory usage: 1.4+ MB

We have some insight from data info. There are categorical datasets with all the columns. Okay, we can see the data types of each column and no null values in there(it’s clean bro).

#check visualization for insight what we can done with the datasets
for i in df.columns:
    sns.countplot(data=df,x=i)
    plt.show()

#check unique value
df.nunique()

class                        2
cap-shape                    6
cap-surface                  4
cap-color                   10
bruises                      2
odor                         9
gill-attachment              2
gill-spacing                 2
gill-size                    2
gill-color                  12
stalk-shape                  2
stalk-root                   5
stalk-surface-above-ring     4
stalk-surface-below-ring     4
stalk-color-above-ring       9
stalk-color-below-ring       9
veil-type                    1
veil-color                   4
ring-number                  3
ring-type                    5
spore-print-color            9
population                   6
habitat                      7
dtype: int64

Perform the selection of dependent variables and independent variables. With class as the dependent variable. Furthermore, the dependent variable is entered into the y variable and the independent variable is entered into the x variable.

x=df.drop('class',axis=1)
y=df['class'] 
x.head()

Use the Label Encorder to convert categorical data into Numerical. Why we must change categorical to numerical data because just numerical data that can be Train/Process with Machine Learning.

Encoder_X = LabelEncoder() 
for col in x.columns:
    x[col] = Encoder_X.fit_transform(x[col])
Encoder_y=LabelEncoder()
y = Encoder_y.fit_transform(y)

array([1, 0, 0, …, 0, 1, 0])

Do the division of training and testing data with a ratio of 80:20

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
X_train

y_train

array([1, 1, 1, …, 0, 1, 0])

Create a machine learning model using a Random Forest

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(oob_score=True)
rf.fit(X_train,y_train)

from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(solver='lbfgs', max_iter=400)
LR.fit(X_train, y_train)

y_pred_LR = LR.predict(X_test)

print(classification_report(y_test,y_pred_LR))

 precision    recall  f1-score   support

           0       0.95      0.95      0.95       843
           1       0.94      0.95      0.95       782

    accuracy                           0.95      1625
   macro avg       0.95      0.95      0.95      1625
weighted avg       0.95      0.95      0.95      1625

Perform tests on the test data

y_pred=rf.predict(X_test)
y_pred

array([0, 1, 1, …, 1, 1, 1])

Next, evaluate the model using the confusion matrix and classification report

sns.heatmap(confusion_matrix(y_test, y_pred),annot=True);

Classification Report

print(classification_report(y_test,y_pred))

 precision    recall  f1-score   support

           0       1.00      1.00      1.00       843
           1       1.00      1.00      1.00       782

    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625

Data and Code

Google Colaboratory

Edit description

colab.research.google.com

Thank’s To : MySkill, Ronny Fahrudin , Priagung Khusumanegara and all my Mentor to help me improve my skill.