⚡️Hacking the Power System. An End-to-End Machine Learning Project. Part 3: Machine Learning Modeling💡

Photo: energepic.com, https://www.pexels.com/@energepic-com-27411

If you want to see this ML Model deployed, please go to:

With this article series, I'll show you how I developed a truly End-to-End DataScience and Machine Learning Project.

Consequently, the project’s main goal is to analyze and model the technical, economic, and environmental behavior of the Colombian electric power system in an end-to-end project.

In this third part, I will show you how I designed several Machine Learning experiments to find the best algorithm to model these time series data.

In red what we will cover in this article.

However, here I will only explain what steps I executed, with some code of course, but not show much of it, for code, just go to the repository:

GitHub - jasantosm/Emisiones-CO2-Sistema-Electrico-Machine-Learning
Contribute to jasantosm/Emisiones-CO2-Sistema-Electrico-Machine-Learning development by creating an account on GitHub.

If you are looking for the first or second part:

⚡️Hacking the Power System. An End-to-End Machine Learning Project. Part 1: Scope and Data Extraction💡
When you are studying to become Machine Learning Engineer, it’s challenging tofind unique projects and datasets. If you find them, they are beautiful, clean,and ready to use with a clear Machine Learning goal, however, that’s not thereal world! Therefore, with this article series, I show you how…
⚡️Hacking the Power System. An End-to-End Machine Learning Project. Part 2: Exploratory Data Analysis💡
The picture was taken by the author in Central Hidroeléctrica del Guavio[https://es.wikipedia.org/wiki/Central_hidroel%C3%A9ctrica_del_Guavio], Colombia When you are studying to become Machine Learning Engineer, it’s challenging tofind unique projects and datasets. If you find them, they are beau…

All images were created by the author unless explicitly stated otherwise.

Machine Learning Goals

  • Understand the general behavior of polluting gas emissions from the national interconnected system.
  • Determine the correlation of the amount of emissions to other variables of the electricity generation system.
  • Understand the influence of climatic phenomena on SIN emissions.
  • Find the influence of fuel consumption on the price of electricity.
  • Create a Machine Learning Model to predict the CO2 Equivalent emissiones of the electrical system.

Tools

The following Libraries were used in this project:

  • pandas
  • matplotlib
  • seaborn
  • scikit-learn

General Defintions

  • National Interconnected System (SIN): It is in fact the colombian power system that is interconected all over the country, serving most of the country.
  • XM S.A. E.S.P.: It's a company part of the ISA group that is in charge of the administration of the electricity market in Colombia, it also operates the National Interconnected System (SIN), integrating its resources of generation, interconnection and transmission of electrical energy.
  • Polluting Emissions: It is the discharge of gaseous fluids, pure or with suspended substances that are residues of human activity or nature and that cause adverse effects on the earth's atmosphere, such as the greenhouse effect or acid rain.
  • 𝐶𝑂2: Carbon dioxide, polluting gas of the atmosphere which produces greenhouse effect.
  • 𝐶𝐻4: Methane, a polluting gas in the atmosphere, which produces a greenhouse effect with a global warming potential 21 times greater than 𝐶𝑂2.
  • 𝑁2𝑂: Nitrous Oxide, a polluting gas in the atmosphere, which produces a greenhouse effect with a global warming potential 310 times greater than 𝐶𝑂2.
  • 𝐶𝑂2𝑒𝑞: Emissions of a gas equivalent to 𝐶𝑂2, it is calculated with the following formula 𝑇𝑜𝑛𝐶𝑂2𝑒𝑞=(𝑇𝑜𝑛𝐺𝑎𝑠)(𝐺𝑊𝑃), where GWP is the global warming potential.

Electrical System Variables

  • Generation: It is the amount of electrical energy generated by the SIN.
  • Demand: It is the amount of electrical energy demanded by the consumers of the SIN.
  • Actual Availability: It is the installed capacity, assets ready to generate, that are capable of starting to generate energy immediately.
  • Energy Contributions: These are the water contributions of the rivers and rain to the SIN's reservoirs , instead of flow, they are presented in energy units, which directly says how much electricity is capable of generating that amount of water.
  • Useful Volume: It is the useful volume of water that the SIN reservoirs have, that is, with the capacity to generate energy, as well as the contributions, it is measured in energy.
  • Fuel Consumption: It is the amount of fossil fuel used by the SIN.
  • Exchange Price: The price of the wholesale energy exchange of the SIN.
  • Emissions: For each of the polluting gases, it represents the quantity in tons of that gas, which the system discharges into the atmosphere. It only includes gases generated directly by plants burning fuel to generate.

The Data

The data were extrated using the following resources:

  • Web Scraper, see part 1.
  • XM S.A. E.E.P. REST API
df_power_co = pd.read_csv('./data/clean_dataset/power_system.csv')

Your can downlad the dataset frome here: https://github.com/jasantosm/Emisiones-CO2-Sistema-Electrico-Machine-Learning/blob/master/data/clean_dataset/power_system.csv

>>> df_power_co.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2653 entries, 0 to 2652
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Unnamed: 0                  2653 non-null   int64  
 1   Date                        2653 non-null   object 
 2   daily_generacion            2653 non-null   float64
 3   daily_consumo_combustible   2653 non-null   float64
 4   daily_demanda               2653 non-null   float64
 5   daily_aportes_energia       2653 non-null   float64
 6   daily_volumen_util_energia  2653 non-null   float64
 7   daily_disponibilidad_real   2653 non-null   float64
 8   daily_precio_bolsa          2653 non-null   float64
 9   daily_emision_CO2_eq        2653 non-null   float64
dtypes: float64(8), int64(1), object(1)
memory usage: 207.4+ KB

Dataset time window:

>>> df_power_co['Date'].head(1)
0    2013-10-01
Name: Date, dtype: object

>>> df_power_co['Date'].tail(1)
2652    2021-01-04
Name: Date, dtype: object

The dataset time window starts on October 1, 2013 and ends on January 4, 2021

Units and conversion

The following table shows the current measurement units of the dataset and the units to which the conversion was made in order to have a more adequate management of the magnitudes.

Variable (Feature)UnitCoverted toMeaning
daily_generacionkWhGWhGeneration
daily_emision_CO2TonTonCO2 Emissions
daily_emision_CH4TonTonCH4 Emissions
daily_emision_N2OTonTonN2O Emissions
daily_consumo_combustibleMBTUMMBTUFuel consumption
daily_demandakWhGWhDemand
daily_aportes_energiakWhGWhEnergy Contributions
daily_volumen_util_energiakWhGWhUseful Volume
daily_disponibilidad_realMWGWh*Actual Availability
daily_precio_bolsaCOP/kWhCOP/kWhMarket Price

Real availability is given in MW of power available to generate electricity, however this unit is not compatible with the energy units that manage the other variables.
For this reason, the calculation of the average real energy availability per day is made, which consists of multiplying the unit of power by the 24 hours of the day.

Experimentation Plan

For this project I am going to evaluate the following scikit-learn models:

  1. Multilayer Perceptron - MLPRegressor
  2. Random Forests - RandomForestRegressor
  3. Support Vector Machines - SVR
  4. Random Forests with gradient boosting - XGBoost

These models were selected for their power to work nonlinear problems.

The metrics with which the models will be evaluated are:

  • mean_absolute_error
  • mean_square_error
  • r2_score

Also, I will use the forward chaining cross validation for time series.

A look to CO2 Equivalent Variable

We want to forcast the 𝐶𝑂2𝑒𝑞, which is the sum of all the CO2 equivalent emissions of CH4 and N2O, so let's take a look of it:

>>> df_CO2_ts = df_power_co[['Date','daily_emision_CO2_eq']].set_index('Date')
>>> df_CO2_ts.describe()
daily_emision_CO2_eq
count2653.000000
mean30415.946459
std12885.491676
min6734.720000
25%20722.350000
50%28767.140000
75%37956.930000
max71519.160000
>>> df_CO2_ts.plot(rot=90, figsize = (12, 5), fontsize = 13.5);

The numbers of CO2 Equivalent are to big for a Machine Learning Model, so it's a good idea to apply an standarization method to reduce their scale.

>>> from sklearn.preprocessing import StandardScaler

>>> scaler = StandardScaler()

>>> df_CO2_ts['daily_emision_CO2_eq'] = scaler.fit_transform(df_CO2_ts['daily_emision_CO2_eq'].values.reshape(-1, 1))

>>> df_CO2_ts.plot(rot=90, figsize = (12, 5), fontsize = 13.5);

Time Series Cross Validation

A time series problem could be transform to a supervised learning problem creating a window of k backwards obserbations as predictors to estimate k1.

For the problem I will use the Forward Chaining approach:

Image taken from: https://towardsdatascience.com/time-series-machine-learning-regression-framework-9ea33929009a

I recommend you to read the following article to have a complete explanation of a Time Series Machine Learning Regression Framework: https://towardsdatascience.com/time-series-machine-learning-regression-framework-9ea33929009a

With the following code I adapted the dataset for the ML Models.

For training data:

>>> def sliding_time(ts, window_size=1):

      n = ts.shape[0] - window_size  
      X = np.empty((n, window_size))
      y = np.empty(n)

      for i in range(window_size, ts.shape[0]):   
        y[i - window_size] = ts[i]
        X[i- window_size, 0:window_size] = np.array(ts[i - window_size:i])

      return X, y
      
>>> data_train = df_CO2_ts['daily_emision_CO2_eq'].loc[:'2018-10-01']  #First 5 years
>>> data_test  = df_CO2_ts['daily_emision_CO2_eq'].loc['2018-10-02':]  #Last 2 years

>>> # Creation of the time windows for training
>>> k=20
>>> X_train, y_train = sliding_time(data_train.values, window_size=k) 

>>> print(f"Number of examples for training: {X_train.shape[0]} (Window size: {X_train.shape[1]})")
Number of examples for training: 1807 (Window size: 20)

>>> print(f"Number of values to predict: {y_train.shape[0]}")
Number of values to predict: 1807

For validation data:

>>> X_test, y_test = sliding_time(data_test.values, window_size=k)

>>> print(f"Number of examples for training: {X_test.shape[0]} (Window size: {X_test.shape[1]})")
Number of examples for training: 806 (Window size: 20)

>>> print(f"Número de valores a predecir: {y_test.shape[0]}")
Number of values to predict: 806

The X_train dataset looks something like:

Finally, I will creat the validation partition using the Forward Chaining approach:

# Selección de los datos en series de tiempo
from sklearn.model_selection import TimeSeriesSplit

n_splits = 6

# La partición nos devuelve los indices de train y test.
tsp = TimeSeriesSplit(n_splits=n_splits)

i = 0
fig = plt.figure(figsize = (12,6), dpi = 110)
plt.set_cmap('Paired')

tsp_indexes = [(train_index, test_index) for (train_index, test_index) in tsp.split(X_train, y_train)]

for train_index, test_index in tsp_indexes:
  plt.plot(train_index, 
           np.full(len(train_index), 1-i*0.001),
           lw = 8, 
           ls= '-.', 
           label = f'Entrenamiento (k = {i + 1})')
  
  plt.plot(test_index, 
           np.full(len(test_index), 1-i*0.001), 
           lw = 8, 
           ls= '-', 
           label = f'Validación (k = {i + 1})')
  i+=1
fig.get_axes()[0].get_yaxis().set_visible(False)
plt.legend(ncol=1, title = 'Índices por partición', );
plt.title('Particiones de entrenamiento y prueba')
plt.show()

So in the next chart are represented the CO2 eq emissions time series for the K=6 dataset is, blue is training set, orange validation set:

Multiple Features

We just take a look of the variable we want to predict, the CO2 Equivalent emissions of the colombian power system. However, as you read above, the system has more variables, or features, that could help us to have a better model.

So let's take a look at them ussing the correlaction matrix plotted in a heat map:

corr_matrix = df_power_co.corr(method='spearman')
sns.heatmap(corr_matrix, vmin=-1, vmax=1);

From this correlations we can conclude, that the following variables have a direct relationship with the emissions of 𝐶𝑂2𝑒𝑞:

  • Fuel consumption (0.968)
  • Market price (0.893)
  • Demand (0.072)
  • Generation (0.049)

And the following variables have an inverse relationship with emissions 𝐶𝑂2𝑒𝑞:

  • Energy contributions (-0.528)
  • Useful energy volume (-0.457)
  • Actual availability (-0.384)

These relationships are shown below:

sns.pairplot(df_power_co,y_vars=['daily_emision_CO2_eq']);

These variables most have the same processing we did for the CO2 Emissions, apply the standar scaler, vivide for train and validation using the Forward Chaining approach, and we got something like this:

Model Training and Selection

After all the data prep we finally arrive to the fun part, model training and selecction.

We will test the following algorithms:

  1. Multilayer Perceptron - MLPRegressor (MLP)
  2. Random Forests - RandomForestRegressor (RFR)
  3. Support Vector Machines - SVR (SVM)
  4. Random Forests with gradient boosting - XGBoost (XGB)

For each model we will try several hyperparameters using GridSearchCV.

The best 10 hyperparameter set of every model will be store in a pandas DataFrame, so at the end of the experimentation we will have 40 models, from that we select the best one organizing the DataFrame by mean_test_score which is the result of GridSearchCV. R2 Coefficient is used by GridSearchCV to calculate the scores.

In order to not fill the article of repetitive code I will show you how I implemented this estrategy with one model, Support Vector Machines - SVR, with the other ones is just to do the same.

params_SVR = {'C': [1, 10, 100, 1000],
          'kernel': ['rbf','poly'],
          'gamma': [0.01,0.001, 0.0001],
          'degree': [2,3,4,5,6,8],
          'epsilon': [0.1,0.2]}

tsp = TimeSeriesSplit(n_splits = 3)

gsearch = GridSearchCV(estimator = SVR(), 
                        cv = tsp,
                        param_grid = params_SVR, 
                        verbose = 3)

gsearch.fit(X_train, y_train)
Fitting 3 folds for each of 288 candidates, totalling 864 fits
[CV] C=1, degree=2, epsilon=0.1, gamma=0.01, kernel=rbf ..............
[CV]  C=1, degree=2, epsilon=0.1, gamma=0.01, kernel=rbf, score=-0.299, total=   0.1s
[CV] C=1, degree=2, epsilon=0.1, gamma=0.01, kernel=rbf ..............
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[CV]  C=1, degree=2, epsilon=0.1, gamma=0.01, kernel=rbf, score=-3.389, total=   0.3s
[CV] C=1, degree=2, epsilon=0.1, gamma=0.01, kernel=rbf ..............
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.4s remaining:    0.0s
[CV]  C=1, degree=2, epsilon=0.1, gamma=0.01, kernel=rbf, score=0.715, total=   0.5s
[CV] C=1, degree=2, epsilon=0.1, gamma=0.01, kernel=poly .............
[CV]  C=1, degree=2, epsilon=0.1, gamma=0.01, kernel=poly, score=0.658, total=   0.1s
[CV] C=1, degree=2, epsilon=0.1, gamma=0.01, kernel=poly .............
[CV]  C=1, degree=2, epsilon=0.1, gamma=0.01, kernel=poly, score=-6.252, total=   0.3s
[CV] C=1, degree=2, epsilon=0.1, gamma=0.01, kernel=poly .............
[CV]  C=1, degree=2, epsilon=0.1, gamma=0.01, kernel=poly, score=-0.859, total=   0.7s

Organizing the best models:

best_SVM_models = pd.DataFrame(gsearch.cv_results_).nlargest(10, 'mean_test_score')
best_SVM_models

The overall metrics suggest that the best model is a Support Vector Machine regressor with the parameters:

{'C': 1000, 'degree': 2, 'epsilon': 0.1, 'gamma': 0.0001, 'kernel': 'rbf'}

So we retrain the SVM algorithm with that hyperparameters:

svr_rbf = SVR(C=1000, degree= 2, epsilon= 0.1, gamma= 0.0001, kernel='rbf')
svr_rbf.fit(X_train, y_train)
y_pred = svr_rbf.predict(X_test)
x = data_test.index[k:]

fig = plt.figure(dpi = 120, figsize = (12, 5))
plt.plot(x, y_test, ls = "--", label="Valor verdadero (pruebas)")
plt.plot(x, y_pred, ls = '-', label="Valor predicho (pruebas)")
plt.title("Predicción vs valores verdaderos (pruebas) - SVR")
plt.xticks(np.arange(0, 1000, 100))
plt.legend();

I also calculated the r2 for each model indepently, so you could see the differences:

{'y_MLP': 0.9055978712322087,
 'y_RFR': 0.8495165657123944,
 'y_SVM': 0.9855505389592659,
 'y_XGB': 0.8799233166644396}

It's clear that for this problem the best model is the Support Vector Machines.

Conclusion

With this project, and article series, I have been implementing and showing an end to end Machine Learning Engineering problem from scracth, with real data.

In this particular part 3 it was implemented a Machine Learning Experimentation plan to select the best fit for the data and be able to predict the emissions of green house gases of the Colombian power system.