solving a classification problem

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (612.63 KB, 11 trang )

Trang 1<div class="page_container" data-page="1">

GENERAL CONFEDERATION OF LABOR OF VIETNAM

TON DUC THANG UNIVERSITY FACULTY OFINFORMATION TECHNOLOGY

INTRODUCTION TO MACHINE LEARNING

</div>Trang 2<div class="page_container" data-page="2">

And also thank Ton Duc Thang University for giving us a modern anddeveloped educational environment.

With hard work and effort we have successfully completed this report. Butsurely, this report cannot avoid mistakes. We are looking forward to receiving fromteacher so that we can improve it better.

We sincerely thank you!

</div>Trang 3<div class="page_container" data-page="3">

In addition, many comments and assessments as well as data from other authorsand organizations have been used in the project, with references and annotations.

If any fraud is found, I am fully responsible for the content of my project.

Ton Duc Thang University is not involved in any copyright infringement or copyrightinfringement in the course of implementation (if any).

Ho Chi Minh, Oct 16 , 2022Author

(sign and write full name)

Phạm Ngọc TuânDương Hòa Mạnh

</div>Trang 4<div class="page_container" data-page="4">

EVALUATION OF INSTRUCTING LECTURER

Confirmation of the instructor

Ho Chi Minh City, 2022 (sign and write full name)

The assessment of the teacher marked

Ho Chi Minh City, 2022 (sign and write full

</div>Trang 5<div class="page_container" data-page="5">

Table of content

</div>Trang 6<div class="page_container" data-page="6">

I. SOLVING A CLASSIFICATION PROBLEM

1.1 About the Data

1.2 Introduce the Data

According to the World Health Organization (WHO) stroke is the 2ndleading cause of death globally, responsible for approximately 11% of total deaths.

This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

1.3 Read and standardize data 1.3.1 Data reading

</div>Trang 7<div class="page_container" data-page="7">

+ Naive Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions.

+ It is a probabilistic classifier, which means it predicts on the basis of the

</div>Trang 8<div class="page_container" data-page="8">

probability of an object.

+ Some popular examples of Naive Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles.

Bayes’ Theorem

- Advantages of Naive Bayes Classifier:

o Naive Bayes is one of the fast and easy ML algorithms to predict aclass of datasets.

o It can be used for Binary as well as Multi-class Classifications.

o It performs well in Multi-class predictions as compared to the otherAlgorithms.

o It is the most popular choice for text classification problems.- Disadvantages of Naive Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated,so it cannot learn the relationship between features.

* Decision tree

A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning

- Advantages of Decision Tree

o Highly Interpretable & can be visualized

o Minimal data preprocessing - missing data handling, normalizing,one-hot-encoding not required

o Handle both neumerical & categorical values

o Supports multi-output

</div>Trang 9<div class="page_container" data-page="9">

- Disadvantages of Decision Tree

o Overfitting - height of tree kept growing with addition of more data

o Slight changes in data or order of data can change the tree

o Imbalanced classes datasets creates biased tree so data needsbalancing

* Random forest

Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takesthe average to improve the predictive accuracy of that dataset." Instead ofrelying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.

- Advantage of random forest

o Random Forest is capable of performing both Classification andRegression tasks.

o It is capable of handling large datasets with high dimensionality.

o It enhances the accuracy of the model and prevents the overfittingissue.

- Disadvantage of random forest

o Although random forest can be used for both classification andregression tasks, it is not more suitable for Regression tasks.K-Nearest neighbor

K-nearest neighbor is one of the simplest supervised-learning algorithms (which works in some cases) in Machine Learning. When training, this algorithm "does not learn" anything from the training data (this is also the reason this algorithm is classified as "lazy learning"), all calculations are performed when it needs to predict the outcome of new data. K-nearest neighbor can be applied to both typesof Supervised learning problem Classification and Regression. KNNis also known as an Instance-based or Memory-based learning algorithm.

</div>Trang 10<div class="page_container" data-page="10">

o Can be used for both regression and classification problems

o Can be used easily with multiclass datasets

o It suffers from skewed class distributions meaning if a specificclass occurs frequently in the training set then it is most likely todominate the majority voting of the new example

o Can't work if there are any missing values

*Logistic regression

Logistic regressiom is a classification algorithm used to find the probability of event success and event failure. It is used when the dependent variable is binary(0/1, True/False, Yes/No) in nature. It supports categorizing data into discrete classes by studying the relationship from a given set of labelled data. It learns a linear

relationship from the given dataset and then introduces a non-linearity inthe form of the Sigmoid function.

- Advantage of logistic regression

o Logistic regression is easier to implement, interpret, and veryefficient to train.

o It can easily extend to multiple classes(multinomial regression) anda natural probabilistic view of class predictions.

o It can interpret model coefficients as indicators of featureimportance.

- Disadvantage of logistic regression

</div>Trang 11<div class="page_container" data-page="11">

o If the number of observations is lesser than the number of features,Logistic Regression should not be used, otherwise, it may lead tooverfitting.

o It constructs linear boundaries.

o It can only be used to predict discrete functions. Hence, thedependent variable of Logistic Regression is bound to the discretenumber set.

</div>

solving a classification problem

INTRODUCTION TO MACHINE LEARNING

<b>EVALUATION OF INSTRUCTING LECTURER</b>

<b>Table of content</b>

<b>I. SOLVING A CLASSIFICATION PROBLEM</b>

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về