lights-out

View on GitHub

Lights Out: An Analysis of Power Outages

Pranav Rajaram, Abhinav Chinnam

This is our final project for the DSC 80 course at UC San Diego.

Introduction

The dataset we analyzed in this project consists of information related to major power outages in the United States from 2000 to 2016. We thought this dataset would be valuable to look into because, according to census.gov, 33.9 million households are impacted by power outages in the US every year. So, being able to predict and better understand power outages would no doubt have a positive real world impact. With this in mind, the question we used as the framework for our analysis was What are the causes and effects of major power outages in the United States?

The data we used was from the Purdue Engineering Research Data site, and a data dictionary from the article “A Multi-Hazard Approach to Assess Severe Weather-Induced Major Power Outage Risks in the U.S” was also provided. The data qualifies a “major” power outage as one that impacted at least 50,000 customers or had an unplanned firm load loss of at least 300 MW. The data incldues not only data points related to the power outage (duration, customers affected, etc.), but also details about regional electricity consumption patterns, economic characteristics, and climate information about the impacted areas.

The original dataset had 1534 rows. Here are the columns we thought to be relevant to our analysis:

Data Cleaning and Exploratory Data Analysis

Data Cleaning

Here are the steps we took to clean the data for analysis:

  1. We manually exported the data from a .xlsx to a .csv file and loaded it into python.
  2. We changed all of the column names to lower case and replaced periods with underscores. For instance, ‘OUTAGE.DURATION’ became ‘outage_duration’. This made it more convenient to access the columns in the future.
  3. We combined OUTAGE.START.DATE and OUTAGE.START.TIME into one pd.Timestamp column called OUTAGE.START. We did the same for OUTAGE.RESTORATION.DATE and OUTAGE.RESTORATION.TIME.
  4. We dropped all of the columns not listed above, and set the index of the DataFrame to the OBS column.
  5. We replaced values of 0 in the CUSTOMERS.AFFECTED, OUTAGE.DURATION, and DEMAND.LOSS.MW columns with NA. This is because we thought values of 0 in those columns indicated a missing value, as it does not make much sense for 0 customers to be affected by a major outage, the duration of an outage to be 0 minutes, or the total loss of demand to be 0 mega watts.

This is a snapshot of what our DataFrame looked like after these cleaning steps. For the sake of appearance, the RES.PRICE, RES.SALES, RES.PERCENT, RES.CUST.PERCENT, RES.CUSTOMERS, POPPCT_URBAN, POPDEN_URBAN, POPULATION, and TOTAL.CUSTOMERS columns are not included in the table below.

year month u_s__state climate_region climate_category cause_category outage_duration demand_loss_mw customers_affected outage_start outage_restoration
2011 7 Minnesota East North Central normal severe weather 70000 nan 70000 2011-07-01 17:00:00 2011-07-03 20:00:00
2014 5 Minnesota East North Central normal intentional attack nan nan nan 2014-05-11 18:38:00 2014-05-11 18:39:00
2010 10 Minnesota East North Central cold severe weather 70000 nan 70000 2010-10-26 20:00:00 2010-10-28 22:00:00
2012 6 Minnesota East North Central normal severe weather 68200 nan 68200 2012-06-19 04:30:00 2012-06-20 23:00:00
2015 7 Minnesota East North Central warm severe weather 250000 250 250000 2015-07-18 02:00:00 2015-07-19 07:00:00

Univariate Analysis

This bar graph shows the different causes of major power outages in the dataset. We can see that severe weather is the most frequent cause, while intentional attacks and system malfunctions were also common culprits. This is helpful to know because we get a better sense of when to be mindful of a potential power outage — it would be much more important for a household to prepare for a power outage if there was a hurricane in the area than islanding.

Bivariate Analysis

This plot shows the number of cuatomers affected by major power outages for each month in the dataset. This information is valuable because it helps us understand when to be most wary of a power outage. Summer and winter months appear to have the highest number of affected customers, which makes sense considering what we learned about severe weather. This graph also sets up our future hypothesis test.

Interesting Aggregates

u_s__state year outage_duration
Oklahoma 2002 1.88113e+06
Florida 2005 1.14378e+06
New York 2003 1.11188e+06
California 2014 933475
California 2002 769750
Texas 2008 750607
Ohio 2003 668750
California 2005 613378
New York 2012 607272
Virginia 2001 600000

This table shows the state and year with the 10 highest average power outage durations. It is a good sanity check of the data, as we can try to match up the results of this aggregation with real life events. For example, Hurricane Katrina would have impacted millions of Floridians in 2005, while the Northeast blackout of 2003 likely caused the large outage duration in New York state in 2003.

This heatmap, which is an interactive version of a pivot table, shows the number of customers affected by major power outages per climate region and year. As expected, areas of the United States prone to severe weather events (Southeast, Northeast, West) tend to have a larger amount of customers affected. We can also match up current events to this plot, such as Hurricane Sandy impacting the Northeast in 2012.

Assessment of Missingness

NMAR Analysis

One column in the dataset that we believe could be Not Missing at Random (NMAR) is OUTAGE.DURATION, meaning the missingness of the column depends on the values themselves. Energy provider companies might avoid reporting the duration of extremely long power outages because they are concerned about backlash and criticism from the public. This would mean that higher values of OUTAGE.DURATION are more likely to be missing than lower ones, making the column not missing at random.

An additional piece of data that would help explain the missingness is the main energy provider for each power outage. If we had that information, we could try to find patterns between the missing values and energy providers, such as if one particular company consistently had missing duration values for major outages.

Missingness Dependency

In this section, we found a column that the DEMAND.LOSS.MW column was Missing At Random (MAR) on, and a column where it was not. This was done with the larger goal of finding ways to address missingness in the DEMAND.LOSS.MW column, like probabilistic imputation. DEMAND.LOSS.MW is the column with the highest number of missing values — 901/1534 entries are NA.

We found that DEMAND.LOSS.MW IS missing on the POPPCT.URBAN column. In order to determine this, we conducted a permutation test with difference of group means as the test statistic. The “groups” in question are False for if the data is not missing, and True for if it is. We used the following pair of hypotheses for the test:

We shuffled the POPPCT.URBAN column 1,000 times to generate a distribution of differences under the null hypothesis, and compared that to the observed difference of means.

This plot shows that distribution. We can see that almost all of the observed difference was greater than the simulated differences, and the p-value was 0.001. Therefore, at a 1% significance level, we reject the null hypothesis that the distribution of the POPPCT.URBAN column when DEMAND.LOSS.MW is missing is the same as the distribution of the column when DEMAND.LOSS.MW is not missing. We conclude that the demand loss of a power outage is MAR, conditional on the urban population percentage of the state where the outage occurred.

We also found that DEMAND.LOSS.MW IS NOT missing on the CLIMATE.CATEGORY column. In order to determine this, we conducted a permutation test with total variance difference (TVD) as the test statistic, since CLIMATE.CATEGORY is categorical. We used the following pair of hypotheses for the test:

We shuffled CLIMATE.CATEGORY 1,000 times to generate a distribution of TVDs under the null hypothesis, and compared that to the observed TVD. The resulting p-value was 0.508, meaning we fail to reject the null hypothesis that the the distribution of the CLIMATE.CATEGORY column when DEMAND.LOSS.MW is missing is the same as the distribution of the column when DEMAND.LOSS.MW is not missing.

Hypothesis Testing

In one of the graphs from the 2nd section, we saw that the number of customers impacted by power outages seemed to be different for each month. We wanted to explore this idea a little more, because being able to determine if month has an effect on power outages would be quite valuable in the real world. It would also help answer one of the main causes of major power outages in the US, as posed by our original analysis question. So, we conducted a hypothesis test with the following set of hypotheses:

We will use TVD as the test statistic here as well, since month is a categorical column. Our significance level is 1%.

We first calculated the observed proportion of customers affected by power outages per month. That distribution is visualized below, as well as the ‘Null Proportion’. In this case, the null proportion is 1/12 for every month, as there are 12 months in the year and the null hypothesis states that power outages are equally likely to occur in all of them.

We calculated the observed TVD of these 12 groups, which was about 0.15. Then, we simulated 1,000 draws from the null distribution, calculating the TVD for each one. The resulting distribution and the observed TVD is shown below.

As we can see, none of the simulated TVDs under the null are as large as the observed one, meaning the p-value is 0.0. So, we reject the null hypothesis at a significance level of 1%, indicating that there is statistically significant evidence to suggest that differences in affected customer distributions between months of the year cannot be solely attributed to sampling variation.

Framing a Prediction Problem

Our prediction problem will be to try and predict the number of customers impacted by a power outage. Our response variable is the CUSTOMERS.AFFECTED column, and because it is continuous, we will use regression as opposed to classification. We chose this column as our target variable because it provided a meaningful and easily understandable impact of a power outage. Companies and people alike will be able to learn how many customers are expected to be affected by an incoming power outage, and make the necessary accomodations.

One important thing to consider is we only want to use information available before the outage in our analysis. While a feature like total demand lost would no doubt be indicative of the number of impacted customers, we would only know the actual data after the outage has occurred, making the model effectively useless. The information we would know at the time of prediction would include things like regional characteristics (climate, population density), general customer energy usage (electricity spending, electricity consumption), and the month/day/time of the prediction.

In order to evaluate our model, we will use Mean Absolute Error (MAE). MAE helps us understand how off the model was on average, and has an easily interpretable unit, which in this case is number of customers. We chose MAE over metrics like Root Mean Squared Error (RMSE) or the correlation coefficient between the predicted and observed values because we liked how MAE preserved the original units while also not disproportionately weighting larger errors.

Baseline Model

Our baseline model was a multiple linear regression model with two features. Here is a breakdown of each feature:

Here is a look at what our initial design matrix looked like (a few of the columns were dropped for the sake of appearance):

poppct_urban_binarized climate_region_Central climate_region_East North Central climate_region_Northeast climate_region_Northwest climate_region_South
1 1 0 0 0 0
1 0 0 0 0 1
1 0 0 0 0 0
0 1 0 0 0 0
1 0 0 0 0 0

We used a 75/25 train/test split on the data, and got promising but far from perfect results. Our mean absolute error (MAE) between the predicted and actual values was 148896, which means the model was on average able to predict the number of customers impacted by a power outage within around 150,000. We think that this is a solid baseline, considering the values in the original CUSTOMERS.AFFECTED column ranged from 30,000 to 3,000,000. However, we believe that we can sharpen the model by experimenting with more potential features and using tools like cross-validation to avoid over/under fitting and make better use of the training and testing data.

Final Model

For our final model, we added three features to the baseline. They are as follows:

We chose to use a Random Forest Regressor for the final model as opposed to a Linear Regression. We thought that the Random Forest model would be better because it is better at handling nonlinear relationships between the features and target variable. Additionally, the Random Forest would be more robust towards outliers, which certainly exist within our data considering the wide range of the target column. The Random Forest Regressor would also be better at avoiding overfitting to the training data, and handles potential multicollinerarity between features.

Now that we are using a Random Forest model, we could configure some hyperparameters to even better optimize the predictions and reduce overfitting. We chose the following two hyperparameters:

We used 5-fold cross-validation to determine the optimal hyperparameters using the GridSearchCV() method in sklearn. The best hyperparameters were 300 estimators and a max tree depth of 5.

Our final model improved over the baseline model, with a new mean absolute error of 138113. This means the new model was better at predicting customers affected by over 10,000 customers. Here is a histogram that plots the actual customers affected and the predicted values.

As we can see, the model is fairly accurate at predicting customers affected between 0 and 500,000, but loses accuracy as the number of affected customers gets higher. In fact, if we had filtered the data so that it dropped the 19 rows where the observed customers affected was greater than 1,000,000, the MAE improves to 97,061. Overall, we think that the final model performs well for the majority of outages and provides valuable insights, although further work can definitely be done to handle extreme cases more effectively.

Fairness Analysis

We wanted to assess if our model’s predictions were “fair” across different groups. That is, for two groups of the data, would the model’s predictions have a similar amount of error?

The two groups we tested were the following:

We used the following set of hypotheses for the permutation test:

Our test statistic will be the absolute difference of root mean squared errors between each group. Larger values of this test statistic indicate that the model is unfair, as the predictions would be quite different for each group. We use absolute difference because the test is two sided.

We shuffled the is_weather column 1,000 times to generate a distribution of differences under the null hypothesis, and compared that to the observed absolute difference of RMSEs.

Our p-value was 0.6. As shown in the plot, this means that there is a 60.0% chance of seeing an absolute difference of RMSE as extreme or more extreme than our observed value. So, we fail to reject the null hypothesis at a 1% significance level, meaning there is no evidence of unfairness in the model’s prediction error between the two groups.