key: cord-0060658-vnxyi5z0 authors: Rajagopalan, Gayathri title: Data Analysis Case Studies date: 2020-12-23 journal: A Python Data Analyst’s Toolkit DOI: 10.1007/978-1-4842-6399-0_8 sha: 6fd80ac04dc698302bd8f2c02d9c791704b24f08 doc_id: 60658 cord_uid: vnxyi5z0 In the last chapter, we looked at the various Python-based visualization libraries and how the functions from these libraries can be used to plot different graphs. Now, we aim to understand the practical applications of the concepts we have discussed so far with the help of case studies. • COVID-19 trend analysis: Dataset capturing the number of cases and deaths for various countries across the world daily for the first six months in the year 2020 For the first case study, you need to refer to the following Wikipedia URL (data is taken directly from the web page): https://en.wikipedia.org/wiki/List_of_2018_box_office_number-one_films_in_ France For the second case study, download a CSV file from the following link: https://github.com/DataRepo2019/Data-files/blob/master/NSIT%20Dwarka.csv For the third case study, download an Excel file from the following link: https:// github.com/DataRepo2019/Data-files/blob/master/COVID-19-geographicdisbtribution-worldwide-2020-06-29.xlsx In addition to the modules and libraries we used in the previous chapters (including Pandas, NumPy, Matplotlib, and Seaborn), we use the requests module in this chapter to make HTTP requests to websites. To use the functions contained in this module, import this module in your Jupyter notebook using the following line: If the requests modules is not installed, you can install it using the following command on the Anaconda Prompt. We will be using the following methodology for each of the case studies: 1. Open a new Jupyter notebook, and perform the following steps: • Import the libraries and data necessary for your analysis • Read the dataset and examine the first five rows (using the head method) • Get information about the data type of each column and the number of non-null values in each column (using the info method) and the dimensions of the dataset (using the shape attribute) • Get summary statistics for each column (using the describe method) and obtain the values of the count, min, max, standard deviation, and percentiles Case study 8-1: Highest grossing movies in France -analyzing unstructured data In this case study, the data is read from an HTML page instead of a conventional CSV file. The URL that we are going to use is the following: https://en.wikipedia.org/wiki/ List_of_2018_box_office_number-one_films_in_France This web page has a table that displays data about the top 50 films in France by revenue, in the year 2018. We import this data in Pandas using methods from the Requests library. Requests is a Python library used for making HTTP requests. It helps with extracting HTML from web pages and interfacing with APIs. Questions that we want to answer through our analysis: 1. Identify the top five films by revenue 2. What is the percentage share (revenue) of each of the top ten movies? 3. How did the monthly average revenue change during the year? Step 1: Importing the data and examining the characteristics of the dataset First, import the libraries and use the necessary functions to retrieve the data. #importing the libraries import requests import pandas as pd import seaborn as sns import matplotlib.pyplot as plt #Importing the data from the webpage into a DataFrame url='https://en.wikipedia.org/wiki/List_of_2018_box_office_number-one_ films_in_France' req=requests.get(url) data=pd.read_html(req.text) We import all the libraries and store the URL in a variable. Then we make an HTTP request to this URL using the get method to retrieve information from this web page. The text attribute of the requests object contains the HTML data, which is passed to the pd.read_html function. This function returns a list of DataFrame objects containing the various tables on the web page. Since there is only one (1), object(4) memory usage: 2.0+ KB As we can see, the data types of the columns are not in the format we need. The "Gross" column represents the gross revenue, which is a numeric column. This column, however, has been assigned an object data type because it contains numeric as well as non-numeric data (characters like ", ", "$" symbol, and letters like "U" and "S"). In the next step, we deal with problems such as these. Step 2: Data wrangling In this step, we will: ['Gross'] .str.replace(r"US\$","").str.replace(r",","") In the preceding statement, we use a series of chained replace methods and the principle of regular expressions to replace the non-numeric characters. The first replace method removes "US$" and the second replace method removes the commas. Replacing a character with an empty string ("") is equivalent to removing the character. Now, let us use the astype method to typecast or change the data type of this column to int64 so that this column can be used for computations and visualizations: #changing the data type of the Gross column to make the column numeric df['Gross']=df ['Gross'] .astype('int64') To check whether these changes have been reflected, we examine the first few records of this column and verify the data type: (5) Output : 0 6557062 1 2127871 2 2006033 3 2771269 4 16604101 Name: Gross, dtype: int64 As we can see from the output, the data type of this column has been changed, and the values do not contain strings any longer. We also need to extract the month part of the date, which we will do by first changing the data type of the "Date" column and then applying the DatetimeIndex method to it, as shown in the following. #changing the data type of the Date column to extract its components df ['Date'] Step 3: Visualization To visualize our data, first we create another DataFrame (df1), which contains a subset of the columns the original DataFrame (df) contains. This DataFrame, df1, contains just two columns -"Film" (the name of the movie) and "Gross" (the gross revenue). Then, we sort the values of the revenue in the descending order. This is shown in the following step. df1=df [['Film','Gross'] ].sort_values(ascending=False,by='Gross') There is an unwanted column ("index") that gets added to this DataFrame that we will remove in the next step. CODE: The first plot we create is a bar graph showing the top five films in terms of revenue: (Figure 8-1 ). #Plotting the top 5 films by revenue #setting the figure size plt.figure(figsize=(10,5)) #creating a bar plot ax=sns.barplot(x='Film',y='Gross',data=df1.head(5)) #rotating the x axis labels ax.set_xticklabels(labels=df1.head()['Film'],rotation=75) #setting the title ax.set_title("Top 5 Films per revenue") #setting the Y-axis labels ax.set_ylabel("Gross revenue") #Labelling the bars in the bar graph for p in ax.patches: ax.annotate(p.get_height(),(p.get_x()+p.get_width()/2,p.get_height()), ha='center',va='bottom') Output: Step 4: Drawing inferences based on analysis and visualizations 1. The average monthly revenue shows wide variation, possibly dependent on the month of release of the movies, which may necessitate further analysis across the years. 2. The top three highest-revenue-grossing movies in France in the year 2018 were Avengers, La Ch'tite Famille, and Les Tuche 3. Case study 8-2: Use of data analysis for air quality management To monitor the status of ambient air quality, The Central Pollution Control Board (CPCB), India, operates a vast network of monitoring stations spread across the country. Parameters regularly monitored include sulfur dioxide (SO 2 ), oxides of nitrogen as nitrogen dioxide (NO 2 ), ozone, and fine particulate matter (PM 2.5 ). Based on trends over the years, air quality in the national capital of Delhi has emerged as a matter of public concern. A stepwise analysis of daily air quality data follows to demonstrate how data analysis could assist in planning interventions as part of air quality management. Note: The name of the dataset used for this case study is: "NSIT Dwarka.csv". Please refer to the technical description section for details on how to import this dataset. Questions that we want to answer through our analysis: 1. Yearly averages: Out of the four pollutants -SO 2 , NO 2 , ozone, and PM 2.5 -which have yearly average levels that regularly surpass the prescribed annual standards? 2. Daily standards: For the pollutants of concern, on how many days in each year are the daily standards exceeded? 3. Temporal variation: Which are the months where the pollution levels exceed critical levels on most days? Step 1: Importing the data and examining the characteristics of the dataset CODE: import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt #aqdata is the name of the DataFrame, short for Air Quality Data. aqdata=pd.read_csv('NSIT Dwarka.csv') aqdata.head() Checking for the data types of the columns: aqdata.info() 2190 non-null object dtypes: object(6) memory usage: 102.8+ KB Observation: Even though the values for SO 2 , NO 2 , ozone, and PM 2.5 are numeric, Pandas reads the data type of these columns as "object". To work with these columns (i.e., plot graphs, observe trends, calculate aggregate values), we need to change the data types of these columns. Also, there seem to be some missing entries. Step 2: Data wrangling Based on the observations in the previous step, in this step, we will 1. Deal with missing values: We have the option of either dropping the null values or substituting the null values. 2. Change the data types for the columns. aqdata.isna().sum() We see that the columns containing numeric values are not recognized as numeric columns, and the columns containing dates are also not recognized correctly. Having columns with incorrect data types becomes an impediment for the next step, where we analyze trends and plot graphs; this step requires the data types of the columns to be in a format that is appropriate for Pandas to read. In the following lines of code, we use the pd.to_datetime method to convert the data type of the "From Date" and "To Date" columns to the datetime type, which makes it easier to analyze individual components of the date like months and years. CODE: Use the info method to check whether the data types have been changed. aqdata.info() Step 3: Data visualization Based on monitored 24-hourly average ambient air concentrations of PM 2.5 , SO 2 , NO 2 , and ozone (O 3 ), yearly averages are plotted to identify parameters for which the prescribed national ambient air quality standards for annual average are exceeded. First, we calculate the yearly averages for each pollutant (PM 2.5 , SO 2 , NO 2 , and ozone), as follows: CODE: Explanation: The notation for naming variables representing the averages of pollutants is as follows: the first letter of the pollutant, the year, and the abbreviation "avg" for average. For instance, s15avg denotes the average level of SO 2 in the year 2015. We use the mean method to calculate the average and the round function to round the average value to two decimal points. We do not consider ozone since yearly standards do not apply to ozone. Next, we create a DataFrame for each pollutant with two columns each. One of the columns represents the year, and the other column shows the yearly average level for that year. #Setting the X-axis label for the NO2 graph ax1.set_xlabel("Years", fontsize=18) ax1.legend().set_visible(False) #Setting the Y-axis label ax1.set_ylabel("Yearly average", fontsize=18) #Creating a dashed line to indicate the annual standard ax1.hlines(40, -.9,15, linestyles="dashed") #Labelling this dashed line ax1.annotate('Annual avg. standard for NO2',(-0.5,38)) #labelling the bars for p in ax1.patches: ax1.annotate(p.get_height(),(p.get_x()+p.get_width()/2,p.get_height()), color="black", ha="left", va ='bottom',fontsize=12) #Plotting the yearly averages similarly for PM2.5 dfp.plot(kind='bar',figsize=(20,5),ax=ax2) ax2.set_title("PM2.5", fontsize=18) ax2.hlines(40, -.9,15, linestyles="dashed") ax2.annotate('Annual avg. standard for PM2.5',(-0.5,48)) ax2.legend().set_visible(False) for p in ax2.patches: ax2.annotate(p.get_height(),(p.get_x()+p.get_width()/2,p.get_height()), color="black", ha="center", va ='bottom',fontsize=12) #Plotting the yearly averages similarly for SO2 dfs.plot(kind='bar',figsize=(20,5),ax=ax3) ax3.hlines(50, -.9,15, linestyles="dashed") ax3.annotate('Annual avg. standard for SO2',(-0.5,48)) ax3.set_title("SO2", fontsize=18) ax3.legend().set_visible(False) for p in ax3.patches: ax3.annotate(p.get_height(),(p.get_x()+p.get_width()/2,p.get_height()), color="black", ha="center", va ='bottom',fontsize=12) Observation: It is evident that standards for annual average are exceeded only for PM 2.5 . For NO 2 , the observed values are relatively close to the prescribed standard. For SO 2 , the observed values are much less than the annual standard. Therefore, for further analysis, only these two pollutants (NO 2 and PM 2.5 ) are considered. While step 1 of the analysis indicates pollutants of concern for air quality management and planning of interventions, in step 2, for each year, we show how various levels of exceedance above standards for 24-hourly values are distributed. In the case of PM 2.5 , we plot the number of days in each year for which observed values fall in the following ranges. Observation: It is seen that PM 2.5 values above 180 μg/m 3 are observed every year, and therefore, to start with, restrictions on major polluting activities, including traffic, could be confined to this category. Likewise, for NO 2 , the number of days in each year on which monitored values exceed the 24-hourly standards of 80 μg/m 3 is plotted (Figure 8-6 ). First, we create a data frame for NO 2 that captures the number of days in each year with values higher than 80 μg/m 3 , as shown in the following. Since observed 24-hourly NO 2 values exceed standard only marginally and that too only for a few days, the next step is confined to further analysis of PM 2.5 . ['2016','2017','2018','2019'] selectionc Output: We can observe from this table that month 1 (January), month 11 (November), and month 12 (December), are the most critical months for all four years, as these months had the highest number of days with PM 2.5 > 180 μg/m 3 . Now that we have all the data in place, let us visualize the critical days for PM 2.5 (Figure 8-7) , using the following code. Output: From the preceding graph, it is observed that most of the critically polluted days fall in January, November, and December. Therefore, based on daily average concentrations of PM 2.5 recorded over the past four years, restrictions on vehicular traffic, construction activities, use of diesel pump sets, diversion of traffic entering Delhi from neighboring states, and other similar activities are likely to be imposed in January, November, and December. To make such decisions for Delhi as a whole, analysis of data from other monitoring stations would also be necessary. Dissemination of data and analysis on the preceding lines would help people prepare in advance for restrictions and also appreciate the rationale behind such measures. The approach demonstrated in the preceding, using data analysis as a tool to assist in air quality management, uses the data recorded at one monitoring station located at the Netaji Subhas Institute of Technology (NSIT), Delhi. The methodology could be carried forward on the following lines. a. Repeat the preceding step for NO 2 to show critical months that account for most of the days with NO 2 recorded values exceeding 24-hourly standards. Doing this exercise would again help identify months facing pollution levels of concern for both parameters, PM 2.5 and NO 2 , and plan. b. Repeat the analysis carried out with data from the air quality monitoring station at NSIT with the use of similar data from other stations so that interventions for Delhi as a whole could be planned. This dataset contains data about the geographic distribution of COVID-19 cases as of 29 th June 2020 (Source: European Center for Disease Control, source URL: https:// www.ecdc.europa.eu/en/publications-data/download-todays-data-geographicdistribution-covid-19-cases-worldwide). Note that this link contains the latest data, but we have used the data as on 29 th June (the link to the dataset is provided in the "Technical requirements" section at the beginning of the chapter). Questions that we want to answer through our analysis include: 1. Which are the countries with the worst mortality rates, maximum cases, and the most fatalities? 2. What is the monthly trend vis-à-vis the number of cases and fatalities since the start of the pandemic? 3. In some of the countries, lockdowns were imposed to help flatten the curve. Did this measure aid in reducing the number of cases? Step 1: Importing the data and examining the characteristics of the dataset Read the dataset and examine the first five rows (using the head method) using the pd. read_excel function: df=pd.read_excel('COVID-19-geographic-distribution-worldwide-2020-06-29.xlsx') df.head() Get information about the data type of each column and the number of non-null values in each column (using the info method). df.info() 26498 non-null float64 continentExp 26562 non-null object dtypes: datetime64 [ns] (1), float64(1), int64(5), object(4) memory usage: 2.2+ MB Get summary statistics for each column (using the describe method) and obtain the values of the count, min, max, standard deviation, and percentiles: df.describe() Step 2: Data wrangling In this step, we will: • Check if the data types of the columns are accurately identified. If not, change the data types: From the output of the info method, we see that all data types of the columns have been correctly identified. • Rename the columns if necessary: In the following code, we are renaming the columns of the DataFrame. #changing the column names df.columns= ['date','day','month','year','cases','deaths','country', 'old_country_code','country_code','population','continent'] • Drop any unnecessary columns or rows: • We see the country code column is present twice (with two different names: 'old_country_code' and 'country_code') in the DataFrame, hence we remove one of the columns ("old_country_code"): Output: Since the percentage of null values is less than 1%, we drop the null values in the following step. • Aggregate the data if the data is in a disaggregated format: The data in this DataFrame is not in an aggregated format, and we convert it into this format using the groupby method in this step. We can group either by country, by continent, or by date. Let us group by the name of the country. #Aggregating the data by country name df_by_country=df1.groupby('country') ['cases','deaths'] . Output (only first nine rows shown): The preceding output shows a consolidated picture of the number of cases and deaths for each country. Let us add another column to this aggregated DataFrame -the mortality rate, which is the ratio of the number of deaths to the number of cases. Step 3: Visualizing the data In our first visualization in this case study, we use the aggregated data in the DataFrame, "df_by_country", to display the top twenty countries by mortality rate (Figure 8-8 ). #Sorting the values for the mortality rate in the descending order plt. figure(figsize=(15,10) ) ax=df_by_country ['mortality_rate'] .sort_values(ascending=False).head(20). plot(kind='bar') ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right") for p in ax.patches: ax.annotate(p.get_height().round(2),(p.get_x()+p.get_width()/2,p.get_he ight()),ha='center',va='bottom') ax.set_xlabel("Country") ax.set_ylabel("Mortality rate") ax.set_title("Countries with highest mortality rates") Output: In the second visualization, we display the ten countries with the highest number of COVID-19 cases, using a pie chart, as shown in Figure 8- Output: In the next visualization, we find out the five countries that have suffered the most in terms of loss to human life from the COVID-19 pandemic, using a bar chart (Figure 8-10) . #sorting the number of deaths in the descending order plt. figure(figsize=(10,6) ) ax=df_by_country ['deaths'] .sort_values(ascending=False).head(5). plot(kind='bar') ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right") for p in ax.patches: ax.annotate(p.get_height(),(p.get_x()+p.get_width()/2,p.get_height()), ha='center',va='bottom') Figure 8-9. Pie chart depicting the share of the top ten countries by COVID-19 cases ax.set_title("Countries suffering the most fatalities from COVID-19") ax.set_xlabel("Countries") ax.set_ylabel("Number of deaths") Output: Now, we plot line graphs to see the month-wise trend in the number of COVID-19 cases and fatalities. To plot the line graphs, we first aggregate the data by month and then plot two line graphs side by side, showing the number of cases and deaths, as shown in Figure 8- Output: Many countries imposed a lockdown to stem the tide of increasing cases and flatten the curve. We now look at four countries -India, the UK, Italy, and Germany -where lockdowns were imposed in March, to see if this measure had the desired impact. First, we create DataFrame objects for each of these countries, with data aggregated month-wise. • Trends: • The total number of cases has been increasing steadily, while the total number of fatalities (deaths) has shown a decrease after April. • Impact of lockdown: We analyzed four countries -India, the UK, Germany, and Italy -where lockdowns were imposed in March. Except for India, all these countries experienced an overall decrease in cases after the lockdown was imposed. In the UK and Germany, the cases went up initially (during the early phase of the lockdown) but started decreasing after this spike. • In this chapter, we looked at various case studies where we imported data from both structured and unstructured data sources. Pandas provides support for reading data from a wide variety of formats. • The requests module has functions that enable us to send HTTP requests to web pages and store the content from the page in an object. • A typical descriptive or exploratory data analysis of a case starts with framing the questions that we want to answer through our analysis and figuring out how to import the data. After this, we get more information about the data -the meanings of various columns, the units of measurement, the number of missing values, the data types of different columns, and so on. • Data wrangling, where we prepare, clean, and structure the data to make it suitable for analysis, is the crux of descriptive or exploratory data analysis. Typical activities involved removing extraneous data, handling null values, renaming columns, aggregating data, and changing data types. • Once the data is prepared and made suitable for analysis, we visualize our data using libraries like Matplotlib, Seaborn, and Pandas to help us gain insights that would answer the questions we initially framed. Question 1 (mini case study) Consider the first table on the following web page: https://en.wikipedia.org/ wiki/Climate_of_South_Africa. It contains data about the maximum and minimum temperatures (in degrees centigrade) in various cities in South Africa, during summers and winters. • Use the appropriate method from the requests module to send a get request to this URL and store the data from the first table on this page in a Pandas DataFrame. • Rename the columns as follows: 'City', 'Summer(max)', 'Summer(min)', 'Winter(max)', 'Winter(min)'. • Replace the negative value in the first row of the 'Winter(min)' column with 0, and convert the data type of this column to int64. • Plot a graph to display the hottest cities in South Africa during summers (use the Summer(max) column). • Plot a graph to display the coldest cities in South Africa during the winters (use the Winter(min) column). The weekly wages of ten employees (with the initials A-J) are as follows: 100, 120, 80, 155, 222, 400, 199, 403, 345, 290 . Store the weekly wages in a DataFrame. • Plot a bar graph to display the wages in the descending order numbers=pd.Series ([100,120,80,155,222,400,199,403,345,290]) #converting the data to a DataFrame numbers.to_frame() #labelling the index numbers.index=list('ABCDEFGHIJ') #labelling the column numbers.columns=['Wages'] ax=numbers.sort_values(ascending=False).plot(kind='bar') #labelling the bars for p in ax.patches: ax.annotate(p.get_height(),(p.get_x()+p.get_width()/2,p.get_height()),h a='center',va='bottom') 1-d; 2-c; 3-a; 4-b Option 3 Content from each table on the web page is stored in a separate DataFrame object. Cases in Italy across months") ax4=fig.add_subplot(2,2,4) df_germany_grouped.plot(kind='line',x='month',y='cases',ax=ax4) ax4.set_title("Cases in Germany across months") Output: Step 4: Drawing inferences based on analysis and visualizations • Number of cases: The United States Number of deaths: The United States, Brazil, the UK, Italy, and France had the highest death tolls The read_html Pandas function reads