unit 14 business intelligence assignment 1

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.78 MB, 26 trang )

<span class="text_page_counter">Trang 1</span><div class="page_container" data-page="1">

1 | P a g e

<b>Higher Nationals in Computing</b>

<b>Unit 14: Business Intelligence ASSIGNMENT 1 </b>

Learner’s name: Nguyen Xuan NamID: GCS200708

Class: GCS0905ASubject ID: 1641

<b>Assessor name: Nguyen Xuan Sam</b>

Assignment due:4 / 3 / 2 0 2 3<b> </b>Assignment submitted:4 / 3 / 2 0 2 3

</div><span class="text_page_counter">Trang 2</span><div class="page_container" data-page="2">

1 | P a g e

<b> ASSIGNMENT 1 FRONT SHEET </b>

<b>Qualification BTEC Level 5 HND Diploma in Computing Unit number and title Unit 14: Business Intelligence</b>

</div><span class="text_page_counter">Trang 3</span><div class="page_container" data-page="3">

2 |

<b> </b>

<b>Summative Feedback: </b>

<b> </b>

<b>Resubmission Feedback:</b>

<b>Signature & Date:</b>

</div><span class="text_page_counter">Trang 4</span><div class="page_container" data-page="4">

3 | P a g e

<b>ASSIGNMENT 1 BRIEF </b>

<b>Qualification BTEC Level 5 HND Diploma in Computing Unit number and title </b> Unit 14: Business Intelligence

<b>Assignment title </b> Assignment 1: Discover business process and BI technologies

<b>Academic Year </b> 2023

<b>Unit Tutor </b> Nguyen Xuan Sam

<b>Submission Format: </b>

Format: The submission is in the form of an individual written report that shows how you have manage the project. This should be written in a concise, formal business style using single spacing and font size 12. You are required to make use of headings, paragraphs and subsections as appropriate, and all work must be supported with research and referenced using the Harvard referencing system. Please also provide a bibliography using the Harvard referencing system.

Submission Students are compulsory to submit the assignment in due date and in a way requested by the Tutors. The form of submission will be a soft copy in PDF posted on corresponding course of The Assignment must be your own work, and not copied by or from another student or from books etc. If you use ideas, quotes or data (such as diagrams) from books, journals or other sources, you must reference your sources, using the Harvard style. Make sure that you know how to reference properly, and that understand the guidelines on plagiarism. If you do not, you definitely get fail

<b>Assignment Brief and Guidance: </b>

Your company is currently working in [Assumed Domain] for 2 years. For a new, young company, the competition in the market is very high. Therefore, the Board of Director has decided to apply Business Intelligence to improve the company business process by making better decisions. The Board of Directors assigns a small group including you in Research & Development Department

</div><span class="text_page_counter">Trang 5</span><div class="page_container" data-page="5">

4 | P a g eto study business intelligence to apply for the company in the coming years.

You need to research about business processes and decision support processes in the company and identify the types of data (unstructured, semi-structured or structured) generated by these processes with examples. You also need to research about current software used in the business process or decision support process and evaluate these usages (benefits and drawbacks). Next you need to understand the types of support for decision-making at different levels (operational, tactical and strategic) within the company and study which business intelligence features can help on that types of support. Study the information systems or technologies (of BI) can be used in this case, compare and contrast them to conclude which should be used.

Your group needs to present the research results to the board in a presentation of 30 minutes.

<b>Learning Outcomes and Assessment Criteria </b>

<b>LO1 Discuss business processes and the mechanisms used to support business decision-making </b>

<b>D1. Critically evaluate the </b>

project management process and appropriate research methodologies applied.

<b>P1 Examine, using examples, </b>

the terms ‘Business Process’ and ‘Supporting Processes’.

<b>M1 Differentiate between </b>

unstructured and semi-structured

<b>data within an organisation. </b>

<b>LO2 Compare the tools and technologies associated with business intelligence functionality </b>

<b>D2 Compare and contrast a range </b>

of information systems and technologies that can be used to support

organisations at operational, tactical and strategic levels.

<b>P2 Compare the types of </b>

support available for business decision-making at varying levels within an organisation.

<b>M2 Justify, with specific examples, </b>

the key features of business intelligence functionality.

</div><span class="text_page_counter">Trang 6</span><div class="page_container" data-page="6">

<b>4.3 Scenarios and analysis ... 20</b>

<b>5 Conclusions and future works ... 24</b>

<b>5.1 Conclusions ... 24</b>

<b>5.2 Future works ... 24</b>

</div><span class="text_page_counter">Trang 7</span><div class="page_container" data-page="7">

<small>Figure 1: The factors impact on house price. </small>

Nowadays, there are many projects that data scientists have built for price prediction in machine learning. In machine learning, we can easily predict a new data based on some features that we already have. One of the most models for predictive analysis is regression. As we know, the purpose of the model is for predicting future results that has been applied in many fields of life like economics, business, banking sector, healthcare industry, e-commerce. entertainment, sports and so on Therefore, this technique is popularly used in building a model . based on some features for prices prediction.

1.2 Motivations

The least transparent sector of our economy is real estate. Real estate prices fluctuate daily and sometimes prices are inflated and not based on estimates. When people decide to buy a home, they look for one that is affordable and meets all their requirements. With machine learning, we can easily predict house prices and

</div><span class="text_page_counter">Trang 8</span><div class="page_container" data-page="8">

7 | P a g edecide whether the house is worth buying or selling for a higher price. In this report, we will forecast home prices in King County, USA. Some features like the size, location, square footage, etc. of the home can be key factors in determining the price.

1.3 Objectives

In this job, there are several important goals that I focus on: How does the size of the house affect the house price? – How does the size of the housing area affect the house price? – The area of the house campus affects the area of the house - How does the area of the house affect the bathroom? - Multiple regression of all features

To answer the questions in the first chapter, I will show the dataset. There are several steps to get information from raw data. These steps are shown in Figure 1 below, namely data collection

In the first chapter, I introduced my work and outlined the goals of the project. The rest of this work includes showcasing my dataset, methods, and results, as well as a demo of the application.

<b>2 Related works and dataset </b>

2.1 Related works

In the study, the authors used some algorithms such as Multiple Linear Regression, Ridge Regression, LASSO Regression, Elastic Net Regression, Ada Boosting Regression, and Gradient Boosting. The purpose of this study is that the authors want to compare different methods and compare the model error of each method. The results show that multiple regression has a fairly low error statistic, proving that multiple regression is one of the suitable models for predicting housing prices.

In a further study, the authors divide the characteristics affecting housing prices into three categories: structural conditions, concepts, and locations. Physical features are those characteristics of the house that can be seen with the human eye, such as: B. Size of the house, number of bedrooms, presence of a kitchen and garage, presence of a garden, size of the plot and other structures, and age of the house. On the other hand, conceptual features are concepts provided by developers to attract buyers, such as: B. The concept of minimalist home, healthy and environmentally friendly, and elite environment. Research has proven that these characteristics are significantly correlated with real estate prices.

</div><span class="text_page_counter">Trang 9</span><div class="page_container" data-page="9">

8 | P a g eIn summary, there are many studies on predicting house prices using different machine learning methods or models. In my project, I will use linear regression and multiple regression for model building and forecasting. I’m going to take advantage of all the features in this dataset and decide to build a good model.

2.2 Dataset

<b>2.2.1 Data collection </b>

I got the data from Kaggle. The dataset is house price forecasts for 2014-2015. The raw data set contains over 21000 entries and 21 columns. In this dataset, the price column is the dependent variable and the rest of the columns except ID and Date are the independent objects. This is the beginning of the plot data set.In Figure 3, the result of this study is that the value is continuously dependent, and the price and other variables are the independent variables.

In Figure 3, the result of this study is that the value is continuously dependent, and the price and other variables are the independent variables.

<b>2.2.2 Description datas et</b>

The dataset includes:

Id: the unique identifier of each house Date: the date when the house was sold Price: the price of the house (thêm đơn vị) Bedrooms: number of bedrooms Bathrooms: the number of bathrooms Sqft_living: the footage of the house Sqft_lot: the footage of the lot Floors: number of floors

Waterfront: house that has waterfront view View: the house has view

Condition: the condition of the house on scale of 1-5 (overall) Grade: the grade of the house unit on scale of 0-10 (overall) Sqft-above: living area of the home, excluding the basement Sqft_basement: living square footage of the basement Yr_built: year that the house built

Yr_renovated: year that the house renovated

</div><span class="text_page_counter">Trang 10</span><div class="page_container" data-page="10">

9 | P a g e Zipcode: zipcode of the house

Lat: latitude coordinate Long: longitude coordinate

Sqft_living15: The area of the interior where the 15 closest neighbors' living spaces are locate Sqft_lot15: the area of the 15 closest neighbors' nearest land lots

<b>2.2.3 Data cleaning </b>

Data cleaning is one of the most important steps before discovery, analysis, and modeling in machine learning. The purpose of data cleaning is to deal with abnormal data such as missing data, outliers, unwanted data, or inconsistent data. There are many ways to clean your data. For example, deleting data, replacing data, changing the data type of a value, and so on. Before cleaning, first examine the raw dataset to see what to do next:

<b>2.2.4 Data processing </b>

For data processing, there are many things to do with the raw dataset: a. Change the datatype of sqft_basement from int into float

</div><span class="text_page_counter">Trang 11</span><div class="page_container" data-page="11">

10 | P a g eb. Change the datatype and create more columns for date column

I change the date column from object into date datatype c. Change the datatype of yr_renovated

In the raw dataset, this column has the datatype of int, so I change it into float.

2.3 Summary

In this chapter, I've cleaned up my raw dataset into a better dataset that's easy to explore and analyze. I believe this is the most important step before doing any prediction or modeling in machine learning. In the next chapter, I begin to build my model and explain and some visualizations will be shown for better analysis.

</div><span class="text_page_counter">Trang 12</span><div class="page_container" data-page="12">

11 | P a g eThe correlation coefficient, which is r, can range from +1.0 (the perfect positive correlation) to -1.0 (the perfect negative correlation). If r =0 which means that there is no correlation between the x and y variable. If all points on the scatter plot fall on a straight line, this is the perfect correlation. As a result, the stronger the linear link between the two variables, the more the correlation differs from 0.0. The correlation coefficient's sign reflects the relationship's direction (David Groebner, 2017).

</div><span class="text_page_counter">Trang 13</span><div class="page_container" data-page="13">

12 | P a g e3.2 Linear regression

The basic equation of simple linear regression is studied in (David Groebner, 2017). In equation 1 where the dependent variable is the result and x is the dependent variable, the relationship is shown as follows:

3.3 Multiple regression

In this project, I use multiple regression to predict the average rating of a book based on 3 characteristics: the number of pages of the book, the number of text reviews, and the number of ratings. In multiple regression, this is the formula (David Groebner, 2017):

</div><span class="text_page_counter">Trang 14</span><div class="page_container" data-page="14">

13 | P a g e

<b>3.4 squares and Adjusted R-squares </b>

R-The adjusted coefficients of determination 2 or 2, which indicate how much of a change in the response is 𝑅 𝑅explained by the model, may be the most frequently used statistic in regression to assess the goodness of fit of a model (Akossou and Palm, 2013).

</div><span class="text_page_counter">Trang 15</span><div class="page_container" data-page="15">

14 | P a g eAdjusted R-squared calculates the percentage of variance that can be explained by only the independent variables that have a significant impact on the interpretation of the dependent variable. R-squares only increase if the independent variable affects the dependent variable.

<b>3.5 Model accuracy </b>

The average absolute error between the actual value and the predicted value is called MAE (Mean Absolute Error). L1 loss, also known as absolute error, is a row-level error calculation that determines the non-negative

</div><span class="text_page_counter">Trang 16</span><div class="page_container" data-page="16">

15 | P a g edifference between prediction and reality. We can better evaluate the model's performance in the entire data set by testing the MAE, which is the sum of these errors.

The mean squared difference between the actual value and the predicted value is called the MSE (Mean Squared Error). The difference between the predicted and the actual squared in the row-level error calculation is called the squared error, sometimes referred to as the L2 loss. We can better evaluate the model's performance on the entire data set by testing the MSE, which is the average of these errors.

</div><span class="text_page_counter">Trang 17</span><div class="page_container" data-page="17">

16 | P a g eRoot Mean Square Error (RMSE) is the standard deviation of the residual (prediction error). The distance between the data points and the regression line is measured by residuals, and the difference of these residuals is measured by RMSE.

<b>4 Simulating scenarios and Results </b>

4.1 Package installation

<b>Step 1: Install the basic packages for this job </b>

There are three packages that need to be installed for data discovery, analysis, and application: Pandas, Numpy and Streamlit. We can install using pip or anaconda:

Using pip:

Using Anaconda:

Pandas version: 1.5.3 Numpy version: 1.24 Streamlit version: 1.11.1

<b>Step 2: Install packages for data visualization </b>

There are two packages I would use for visualization: seaborn and matplotlib Use pip:

Use Anaconda:

</div><span class="text_page_counter">Trang 18</span><div class="page_container" data-page="18">

17 | P a g eMatplotlib version: 3.7.1

Seaborn version: 0.12.2

<b>Step 3: Install packages for modeling I am using Scikit-learn package for modeling, and this package requires: </b>

- Python (>= 3.8) - NumPy (>= 1.24) - SciPy (>= 1.4.1) - joblib (>= 0.11) Install Scikit-learn by Using pip:

Using Anaconda:

Scikit-learn version: 1.1.0 The second package is Statsmodels

<b>Step 4: Install package for map </b>

I am using Folium package for showing map: Using pip:

Using Anaconda:

Folium version: 0.14.0

After installing all of the required packages for this work, I will import all of them in Jupyter Notebook, except Streamlit package:

</div><span class="text_page_counter">Trang 19</span><div class="page_container" data-page="19">

18 | P a g e4.2 Correlation

I will explore the correlation of the dataset. I envision a heatmap to show this:

</div><span class="text_page_counter">Trang 20</span><div class="page_container" data-page="20">

19 | P a g eAs we can see from the heatmap, I collect some high correlation pair because the correlation score above 0.5: - price and sqft_above: 0.605567

- price and sqft_living: 0.702035 - bathrooms and sqft_living: 0.754665 - sqft_above and sqft_living: 0.876597