Tải bản đầy đủ (.pdf) (26 trang)

team4 group project iris flower classiffication

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (730.07 KB, 26 trang )

<span class="text_page_counter">Trang 1</span><div class="page_container" data-page="1">

<b>BỘ GIÁO DỤC VÀ ĐẠO TẠOTRƯỜNG ĐẠI HỌC DUY TÂN</b>

<b>KHOA ĐÀO TẠO QUỐC TẾ</b>

<b>ARTIFICIAL INTELLIGENCE (FOR BUSINESS)INSTRUCTOR: DR. Soon Goo Hong</b>

<b>CLASS: IS-CS 468 AIS</b>

<b>Term Group Project</b>

<b>“IRIS FLOWER CLASSIFICATION”</b>

<b> Team 4</b>

 Phan Van Minh Manh - 27211445925  Tran Thi Thu Hong - 27201401792  Doan Thien Nhan - 27211201936

</div><span class="text_page_counter">Trang 2</span><div class="page_container" data-page="2">

Da Nang, 12 December, 2023

</div><span class="text_page_counter">Trang 4</span><div class="page_container" data-page="4">

<b>1. Overview</b>

The Iris Flower Classification problem requires you to identify three iris flower species based on four features: sepal length, sepal width, petal length, and petal width. The problem has importance because it has several practical uses, including plant breeding, horticulture, and environmental monitoring.

The results of our research of the Iris Flower dataset employing three distinct

<b>classification algorithms will be presented in this report: K-Nearest Neighbors</b>

<b>(KNN), Decision Tree, and Logistic Regression. The performance of these</b>

algorithms will be compared using various measures such as accuracy, recall, precision, and F1 score. We will also guess the species of a new instance based on the supplied features and make some suggestions for future changes.

<b>1. K-Nearest Neighbors (KNN)BRIGHTICS PROCESS </b>

<b> DATA LOAD</b>

- Load the KNN data from “sample_iris.csv”.

- We upload a sample data provided by Samsung Brightics AI - Click ‘upload’ button and search ‘sample_iris.csv’ then click it.. - Click ‘Run’ button.

</div><span class="text_page_counter">Trang 5</span><div class="page_container" data-page="5">

 Group by: species.

 In the output panel, you can see that there are two distinct data sets. In other words, they are separated into "Split data (train_table)" and "Split data (test_table).

</div><span class="text_page_counter">Trang 6</span><div class="page_container" data-page="6">

<b>KNN Classification</b>

 Parameter:

 Inputs: click ‘Empty’ in the ‘test_table’.

 Set Inputs: drag ‘test_table’ in the Split Data and drop it to the ‘Drop Data’ in the ‘test_table’.

 Feature Columns: select all.  Label Columns: species.

 Inputs: default. Label Column: ‘species’.  Prediction Column: ‘prediction.

</div><span class="text_page_counter">Trang 7</span><div class="page_container" data-page="7">

<b>Comment: The accuracy of the predictions was perfectly high (Accuracy: 1.0).</b>

With the KNN model, we successfully classified 3 flower species: setosa, versicolor, virginica with 100% accuracy.

</div><span class="text_page_counter">Trang 8</span><div class="page_container" data-page="8">

Accuracy is the proportion of the total number of predictions that are correct. A high accuracy score means that the model is making correct predictions most of the time.

Precision refers to the proportion of positive predictions made by the model that are actually true positives. The denominator becomes TP+FP, as shown in the formula. If the precision index is higher, it ensures that the ratio of positive predictions to actual positives is higher.

After using k =1, 3, 5 the model achieved an accuracy of 0.967. With k = 7, 9, 11 the model achieves an accuracy of 1.0.

As we can see, if the value of K is 1, 3, or 5 then the accuracy is lower compared to when K has a value greater than 5. On the other hand, as the value of K increases, the risk of the model overfitting also increases. Therefore, with a K value of 7, the risk of overfitting is minimized, and simultaneously, the four metrics for evaluating the model achieve their highest values, ensuring correct predictions most of the time.

Therefore, the best k for this dataset is .7

</div><span class="text_page_counter">Trang 9</span><div class="page_container" data-page="9">

Based on the classification evaluation results with k is 7, model performed very well on the sample_iris dataset. Here is the analysist about the metrics

 Accuracy: The model's accuracy is 1.0, meaning the model correctly predicts the flower species in about 100% of cases.

 Setosa: The model classified this species perfectly with F1, Precision and Recall all 1.0. This shows that the model recognized all Setosa samples without any errors.

 Virginica: The model classified this species perfectly with F1, Precision and Recall all 1.0. This shows that the model recognized all Virginica samples without any errors.  Versicolor: The model classified this species perfectly with

F1, Precision and Recall all 1.0. This shows that the model recognized all Versicolor samples without any errors.

<b>2. Decision Tree</b>

<b> BRIGHTICS PROCESS</b>

<b>DATA LOAD</b>

<b>- Load the KNN data from “sample_iris.csv”.</b>

- We upload sample data provided by Samsung Brightics AI.

</div><span class="text_page_counter">Trang 11</span><div class="page_container" data-page="11">

After running the Decision Tree approach to classifying cases with max depth = 3, 5, 7 using species as the outcome variable, we obtained the best max depth = 3. With max depth = 3, the model achieved an accuracy of <b>0.967</b>. With max depth = 5 or 7, The model predicts with the same accuracy of <b>0.93</b>. Therefore,

<b>the best max depth is 3.</b>

<b>Decision Tree Classification Train</b>

</div><span class="text_page_counter">Trang 12</span><div class="page_container" data-page="12">

 Splitter: Best.

 Max Depth: 3 , 5, 7 (Replace the values one by one).

<b>Decision Tree Classification Predict</b>

- <b>Parameter</b>

 Inputs: click ‘Empty’ in the ‘test_table’.

 Set Inputs: drag ‘test_table’ in the Split Data and drop it to the ‘Drop Data’ in the ‘test_table’.

</div><span class="text_page_counter">Trang 13</span><div class="page_container" data-page="13">

 Parameter:

 Label Column: ‘species’.  Prediction Column: ‘prediction’.

</div><span class="text_page_counter">Trang 14</span><div class="page_container" data-page="14">

<b>Comment: With max depth = 3, the accuracy of the predictions was</b>

exceptionally high (Accuracy: 0.967). With the Decision Tree model, we successfully classified 3 flower species: setosa (10/10), versicolor (10/10), virginica (9/10). The classification of the two species setosa and versicolor is absolutely accurate. Meanwhile, Iris virginica only correctly classified 9 out of 10 records (0.9).

<b>3. Logistic RegressionBrightics Process</b>

<b>DATA LOAD</b>

- Load the KNN data from “sample_iris.csv”.

- We upload a sample data provided by Samsung Brightics AI.

</div><span class="text_page_counter">Trang 15</span><div class="page_container" data-page="15">

<b>PRE-PROCESSINGQuery Executor</b>

Perform the conversion of the dependent variable into decimal format (numeric type) as per the input conditions (species), resulting in 1s, 2s, and 0s.

<b>DESCRIPTIVE ANALYSISStatistic Summary</b>

 For the Number type variable (sepal length, sepal width, petal length, petal width), examine various statistics based on species.

</div><span class="text_page_counter">Trang 16</span><div class="page_container" data-page="16">

 Parameter:

 Columns: sepal length, sepal width, petal length, petal width.  Target Statistic: Max, Min, Average, Standard deviation.  Group by: species.

<b>Select Column</b>

 To transform the categorical variable into a String format.  Parameter:

 Condition: Change the Type of the " sepal length, sepal width, petal length, petal width " variable to String.

</div><span class="text_page_counter">Trang 17</span><div class="page_container" data-page="17">

<b>String Summary</b>

Examine the frequencies and proportions of species and the categorical variables using them as separators.

 Parameter:

 Input Columns: sepal length, sepal width, petal length, petal width.  Group by: species.

</div><span class="text_page_counter">Trang 18</span><div class="page_container" data-page="18">

<b>Logistic Regression Train</b>

Select the dependent variable (spec_cd) and explanatory variables (sepal length, sepal width, petal length, petal width), then proceed with the analysis.

- Parameter :

 Inputs: Split Data-train_table.

 Feautre Columns: sepal length, sepal width, petal length, petal width.  Label Column: spec_cd.

</div><span class="text_page_counter">Trang 19</span><div class="page_container" data-page="19">

<b>Logistic Regression Predict</b>

Perform predictions by applying the regression equation generated from.

 Inputs: Logistic Regression Predict .  Label Column: spec_cd.

 Prediction Column: prediction.

Based on the classification evaluation results, the logistic regression model performed very well on the sample_iris dataset. Here are some detailed comments:

<b> Accuracy: The model's accuracy is 0.967, which means it accurately</b>

guesses the flower species in around 96.7% of cases.

<b> Species 1 (Setosa): The model correctly categorized this species with F1,</b>

Precision, and Recall all 1.0. This shows that the model correctly identified all Setosa samples.

</div><span class="text_page_counter">Trang 20</span><div class="page_container" data-page="20">

<b> Species 2 (Virginica): The model performed well for this species as well,</b>

with an F1 of 0.95, Precision of 1.0, and Recall of 0.9. This suggests that the model properly detected all of the Virginica samples predicted, however some Virginica samples were missing.

<b> Species 0 (Versicolor): The model has F1 of 0.95, Precision of 0.91, and</b>

Recall of 1.0. This shows the model accurately recognized all specimens predicted to be Versicolor, however, there were several instances when the model incorrectly classified other species.

<b>Evaluation 2</b>

<b>Plot ROC and PR CurvesSetosa_1</b>

Check the performance through the plots of ROC (Receiver Operating Characteristic) and PR (Precision-Recall).

In this case, the classification performance targeted is for spec_cd = 1 (setosa). Parameter

 Label Column: spec_cd.

 Probability Column: probability_1.  Positive Label: 1.

</div><span class="text_page_counter">Trang 21</span><div class="page_container" data-page="21">

In the ROC curve chart, verify the threshold: 0.69 and the AUC (Area Under the Curve) value of 1.00.

Check the performance through the plots of ROC (Receiver Operating Characteristic) and PR (Precision-Recall).

In this case, the classification performance targeted is for spec_cd = 0 (versicolor).

 Label Column: spec_cd.

 Probability Column: probability_0.  Positive Label: 0.

</div><span class="text_page_counter">Trang 22</span><div class="page_container" data-page="22">

In the ROC curve chart, verify the threshold: 0.55 and the AUC (Area Under the Curve) value of 1.00.

Check the performance through the plots of ROC (Receiver Operating Characteristic) and PR (Precision-Recall).

In this case, the classification performance targeted is for spec_cd = 2 (virginica).

 Label Column: spec_cd

 Probability Column: probability_2  Positive Label : 2

</div><span class="text_page_counter">Trang 23</span><div class="page_container" data-page="23">

In the ROC curve chart, verify the threshold: 0.46 and the AUC (Area Under the Curve) value of 1.00.

<b>Comment: We get pretty good accuracy (96.7%) in iris flower classification</b>

using sepal length, sepal width, petal length, and petal width.

</div><span class="text_page_counter">Trang 24</span><div class="page_container" data-page="24">

<b>5. Prediction for the given data</b>

<b>Create table</b>

<b>K-Nearest Neighbors</b>

<b>Comment: Using the provided data, we use the previously trained KNN model</b>

to predict species. The model suggests that this is <b>Iris Setosa</b>, with a probability of 85.71% for Iris Setosa and a probability of 14.29% for Iris Versicolor.

<b>Decision Tree</b>

</div><span class="text_page_counter">Trang 25</span><div class="page_container" data-page="25">

<b>Comment: Using the provided data, we use the Decision Tree model trained</b>

previously to predict species. The model suggests this is <b>Iris Versicolor</b>, with a chance of 0% for Iris Setosa, 97.44% for Iris Versicolor, and 2.56% for Iris Virginica.

<b>Logistics Regression</b>

<b>Comment: Using the data provided, we use the previously trained Logistic</b>

Regression model to predict species. The model suggests this is <b>Iris Setosa</b>, with a chance of 89.69% for Iris Setosa, 10.02% for Iris Versicolor, and 0.29% for Iris Virginica.

<b>Executive Summary</b>

<b>Our team analyzed the data and grouped it into three groups. Setosa,</b>

<b>Versicolor, and Virginica are the three categories. Three machine learning</b>

<b>techniques were used for classification: K-Nearest Neighbors (KNN), Decision</b>

<b>Tree, and Logistic Regression. The three algorithms achieved the following</b>

<b>levels of accuracy: 100%, 96.7%, and 96.7%.</b>

To conduct the analysis, our team performed the following steps:

<b>1. Data Collection: We get data from Samsung Brightics AI sources.</b>

</div><span class="text_page_counter">Trang 26</span><div class="page_container" data-page="26">

<b>2. Cleaning the Data: We cleaned the data by eliminating duplicates, missing</b>

values, and outliers.

<b>3. Data Preprocessing: We preprocessed the data by dividing it into training</b>

and testing sets.

<b>4. Model Training: On the training set, we trained three machine learning</b>

models using the K-Nearest Neighbors (KNN) with k = 7, Decision Tree with max depth = 3, and Logistic Regression algorithms.

<b>5. Model Evaluation: We examined the models on the testing set and obtained</b>

the following levels of accuracy: 100%, 96.7%, and 96.7%.

<b>6. Classification: We classified the data into three groups using the trained</b>

models: Setosa, Versicolor, and Virginica.

<b>Result : On the testing set (Accuracy)</b>

<b> </b>

<b>On the given data</b>

<b>K-Nearest Neighbors: </b>The model predicts that this is <b>Iris Setosa.Decision Tree: </b>The model predicts that this is <b>Iris Versicolor</b>.

<b>Logistic Regression: </b>The model predicts that this is <b>Iris Setosa.</b>

<small>100%%</small>

</div>

×