Volcanic activity detection and noise characterization using machine learning

Author(s)

  • Author1 = {“name”: “Myles Mason”, “affiliation”: “Virginia Tech”, “email”: “mylesm18@vt.edu”, “orcid”: “0000-0002-8811-8294”}

  • Author2 = {“name”: “John Wenskovitch”, “affiliation”: “Virginia Tech”, “email”: “jw87@vt.edu”, “orcid”: “0000-0002-0573-6442”}

  • Author3 = {“name”: “D. Sarah Stamps”, “affiliation”: “Virginia Tech”, “email”: “dstamps@vt.edu”,”orcid”: “0000-0002-3531-1752”}

  • Author4 = {“name”: “Joshua Robert Jones”, “affiliation”: “Virginia Tech”, “email”: “joshj55@vt.edu”, “orcid”: “0000-0002-6078-4287”}

  • Author5 = {“name”: “Mike Dye”, “affiliation”: “Unaffiliated”, “email”: “mike@mikedye.com”, “orcid”: ” 0000-0003-2065-870X”}

Purpose

This Jupyter notebook explores methods towards characterizing noise and eventually predicting volcanic activity for Ol Doinyo Lengai (an active volcano in Tanzania) with machine learning using the TZVOLCANO CHORDS portal. Machine learning is a powerful tool that enables the automatization of complex mathematical and analytical models. In this Jupyter notebook, the components are time, height, latitude, and longitude. The predicted component values are the following heights. This project uses Global Navigation Satellite System (GNSS) data from the EarthCube CHORDS portal TZVOLCANO (Stamps et al. 2016; Daniels et al., 2016; Kerkez et al., 2016), which is the online interface for obtaining open-access real-time positioning data collected around Ol Doinyo Lengai. The bulk of the project is the exploration of the data and later prediction of height points. The station that this project analyzes is OLO1 for days 12/16/2020 and 04/16/2021.

Technical contributions

  • The training of the models and analysis uses basic linear algebra and statistics

  • The main libraries used (NumPy and pandas) are both libraries for data manipulation and linear algebra

  • The TZVOLCANO CHORDS portal linked above is the location of the data

  • Implementation of Linear Regression for prediction on time-series data

Methodology

The desired data was imported and selected with a range of about five months. The user downloads data from the TZVOLCANO CHORDS portal as a geojson file (.JSON). Previously downloaded JSON’s are converted to a Dataframe for easy manipulation further in the project. The information is then visualized for better analysis and statistical metrics running. Sample data size is increased for the project; A second JSON is introduced at the beginning of the data processing and analysis, which will later be used for OLO1 to predict height data from OLO1 on 4/16/2021 using 12/16/2020. We set up four series objects from the original data frame to be used for inputs for machine learning algorithms. The data set up for a linear regression shows the difference in target vs. predicted data in scatter plot form. Finally, nine predictions are displayed graphically with adjusting the test size data to demonstrate the different day prediction positions.

Results

This notebook explored predicting height data from the TZVOLCANO CHORDS portal using Linear Regression from different days. It also evaluates how much test data is needed to best predict height data. We find that having 10% test data yields the best results for predictions with the Mean Squared Error of 8.325e-5% . For predictions from data inputted and predicted from a single day we find the 75% test data yields the best results with an average error of -1.074e-4 meters.

Funding

Keywords

keywords=[“tzDF”, “Linear Regression”, “Concat”, “Transpose”,”Mean Squared Error(MSE)”]

Citation

Mason, Myles, John Wenskovitch, D. Sarah Stamps, Joshua Robert Jones, Mike Dye (2021), EC_01_Volcanic_activity_detection_and_noise_characterization_using_machine learning, EarthCube Annual Meeting.

Suggested next steps

Noise activity in the data set is a topic that should be explored. Noise exploration is crucial to the development of this project because understanding the noise and being able to filter and characterize noise will allow any day to be analyzed efficiently and for more reliable predictions of hazardous volcanic deformation. Future next steps would include applying more classifying algorithms such as DBSCAN to characterize the noise.

Acknowledgements

  • Virginia Tech Department of Geosciences

  • Alice and Luther Hamlet

Setup

Library import

# Data manipulation
import pandas as pd
import json
import numpy as np
import math
from datetime import datetime as dt


# Visualizations
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Parameter definitions

  • tzDF: intial dataframe contains 12/06/2020 data

  • tz2DF: secondary dataframe contains 04/16/2021 data

  • ONE_THROUGH_TWENTY: array of values in tzDF[“measurements_height”] used for prediction

  • TWO_THROUGH_TWENTY_ONE:array of values in tzDF[“measurements_height”] used for prediction

  • THREE_THROUGH_TWENTY_TWO: array of values in tzDF[“measurements_height”] used for prediction

  • FOUR_THROUGH_TWENTY_THREE: array of values in tzDF[“measurements_height”] used for prediction

  • DECEMBER_SERIES_X: inputted height for 12/16/2020 data for Linear Regression

  • DECEMBER_SERIES_Y: target height data for 12/16/2020

  • APRIL_SERIES_X: inputted height data for 04/16/2021

  • APRIL_SERIES_Y: target height data for 04/16/2021

  • APRIL_PREDICTION’: Predicted values from the two days

Data import

The CHORDS notebook portal is where the data is acessed.

#Import for JSON files for manipulation
''' Both files are station one, but the first date is December 16,2020
    while the second date is April 16 2021.
'''
with open('OLO1_12_16_20.geojson', 'r', encoding="utf-8") as infile:
    tzList = json.load(infile)

with open('OLO1_4_16_21.geojson', 'r', encoding="utf-8") as infile:
    tz2List = json.load(infile)

Data processing and analysis

#Convert both JSON's into  a partially-flattened pandas DataFrame   
tzDF =  pd.json_normalize(tzList["features"][0]["properties"]["data"], sep='_')
tz2DF =  pd.json_normalize(tzList["features"][0]["properties"]["data"], sep='_')
#Overview of numerical elements of the data
print(tzDF.describe())
tzDF
       measurements_lat  measurements_height  measurements_lon
count      5.990000e+03          5990.000000      5.990000e+03
mean       2.734205e+00           988.153415      3.595022e+01
std        2.869056e-13             0.020116      1.547341e-07
min        2.734205e+00           988.095000      3.595022e+01
25%        2.734205e+00           988.139000      3.595022e+01
50%        2.734205e+00           988.152000      3.595022e+01
75%        2.734205e+00           988.167000      3.595022e+01
max        2.734205e+00           988.226000      3.595022e+01
time test measurements_lat measurements_height measurements_lon
0 2020-12-16T05:08:18Z false 2.734205 988.203 35.950217
1 2020-12-16T05:08:19Z false 2.734205 988.206 35.950217
2 2020-12-16T05:08:20Z false 2.734205 988.214 35.950217
3 2020-12-16T05:08:21Z false 2.734205 988.226 35.950217
4 2020-12-16T05:24:41Z false 2.734205 988.199 35.950217
... ... ... ... ... ...
5985 2020-12-17T04:48:44Z false 2.734205 988.139 35.950217
5986 2020-12-17T04:53:51Z false 2.734205 988.147 35.950217
5987 2020-12-17T04:53:52Z false 2.734205 988.132 35.950217
5988 2020-12-17T04:55:09Z false 2.734205 988.137 35.950217
5989 2020-12-17T04:59:54Z false 2.734205 988.147 35.950217

5990 rows × 5 columns

From the .describe call looking at our data frame, there does not seem to be much variation in columns measurements_lat and measurements_lon because the DataFrame is from a single day. The reason for choosing to explore the measurements_height column in this notebook is because of the hypothesis that the surface will uplift if there is magma reservoir inflation or the surface will subside if there is magma reservior deflation. From viewing the data frame, the time column is in a time series form with extra separator variables “T” and “Z” in the following two cells; we will convert the timestamp column into an integer form for easy manipulation.

Function to convert timestamp column of tzDF and tz2DF from time series to an integer for easy manipulation.

def timeconvertfunc(timestamp):
 #The format of tzDF["measurments_height] is in string form and timeseries so this method make it an integer"
  ts = pd.Timestamp(timestamp, tz=None).to_pydatetime()
  ts = 3600*ts.hour + 60*ts.minute + ts.second

  return ts
#Applying above method to the two data frames
tzDF["timeconvert"] = tzDF["time"].apply(timeconvertfunc)
tz2DF["timeconvert"] = tzDF["time"].apply(timeconvertfunc)

Visualization of basic statistics from measurements_height and linear regression

In the code block below the four series objects are partitions of the measurements_height column in the tzDF DataFrame. We create these partitions to feed into a linear regression model for predictions.

# Series from the first Data Frame

ONE_THROUGH_TWENTY = tzDF["measurements_height"].loc[1:21].values.reshape(-1,1)
TWO_THROUGH_TWENTY_ONE = tzDF["measurements_height"].loc[2:22].values.reshape(-1,1)
THREE_THROUGH_TWENTY_TWO =tzDF["measurements_height"].loc[3:23].values.reshape(-1,1)
FOUR_THROUGH_TWENTY_FOUR = tzDF["measurements_height"].loc[4:24].values.reshape(-1,1)

#Linear Regression model on columns 1-20 and 2-21

lm = LinearRegression()
lm.fit(ONE_THROUGH_TWENTY ,TWO_THROUGH_TWENTY_ONE)
y_pred = lm.predict(ONE_THROUGH_TWENTY)
plt.xlabel("actual height(m)")
plt.ylabel("predicted height(m)")
plt.title("actual height vs predicted height")
plt.scatter(ONE_THROUGH_TWENTY,TWO_THROUGH_TWENTY_ONE)
plt.plot(ONE_THROUGH_TWENTY,y_pred,color="red")
plt.show()

#Method to round for sigfigs
def to_precision(x,p):
    x = float(x)
    if x == 0.:
        return "0." + "0"*(p-1)
    out = []
    if x < 0:
        out.append("-")
        x = -x
    e = int(math.log10(x))
    tens = math.pow(10, e - p + 1)
    n = math.floor(x/tens)
    if n < math.pow(10, p - 1):
        e = e -1
        tens = math.pow(10, e - p+1)
        n = math.floor(x / tens)
    if abs((n + 1.) * tens - x) <= abs(n * tens -x):
        n = n + 1
    if n >= math.pow(10,p):
        n = n / 10.
        e = e + 1
    m = "%.*g" % (p, n)
    if e < -2 or e >= p:
        out.append(m[0])
        if p > 1:
            out.append(".")
            out.extend(m[1:p])
        out.append('e')
        if e > 0:
            out.append("+")
        out.append(str(e))
    elif e == (p -1):
        out.append(m)
    elif e >= 0:
        out.append(m[:e+1])
        if e+1 < len(m):
            out.append(".")
            out.extend(m[e+1:])
    else:
        out.append("0.")
        out.extend(["0"]*-(e+1))
        out.append(m)

    return "".join(out)
print("The b value is",to_precision(lm.coef_[0][0],4))

print("The m value is",to_precision(lm.intercept_[0],4))


print("The R^2 value is",to_precision(lm.score(ONE_THROUGH_TWENTY,TWO_THROUGH_TWENTY_ONE),4),)
../../_images/MM_01_Volcanic_activity_detection_and_noise_characterization_using_machine_learning_16_0.png
The b value is 0.8366
The m value is 161.5
The R^2 value is 0.7395

From the above model using the first through the twentieth column and second, through the twenty-second column, we yield a Coefficient of Correlation (R^2) value of about 0.74, which shows a positive correlation between the two inputs. So about 36% of the variation is residing in the residual.

Initiate Linear Regression

lm = LinearRegression()
lm.fit(TWO_THROUGH_TWENTY_ONE ,THREE_THROUGH_TWENTY_TWO)
y_pred1 = lm.predict(TWO_THROUGH_TWENTY_ONE)
plt.scatter(TWO_THROUGH_TWENTY_ONE,THREE_THROUGH_TWENTY_TWO)
plt.plot(TWO_THROUGH_TWENTY_ONE,y_pred1,color="green")
plt.xlabel("actual height(m)")
plt.ylabel("predicted height(m)")
plt.title("actual height vs predicted height")
plt.show()


print("This is the b value",to_precision(lm.coef_[0][0],4))
print("This is the m value",to_precision(lm.intercept_[0],4))
print("This is the R^2",to_precision(lm.score(TWO_THROUGH_TWENTY_ONE,THREE_THROUGH_TWENTY_TWO),4))
../../_images/MM_01_Volcanic_activity_detection_and_noise_characterization_using_machine_learning_19_0.png
This is the b value 0.8102
This is the m value 187.5
This is the R^2 0.7257

From the above model using the first through the twentieth column and the second through the twenty-second column, we yield a Coefficient of Correlation (R^2) value of about 0.72, which shows a positive correlation between the two inputs. So about 38% of the variation is residing in the residual.

Linear Regression from single day

The following code chunk uses Linear Regression (specifically with rows of height measurement) for one_Through_Twenty,two_Through_Twenty_One, and three_Through_Twenty_Two. The data frame used for the model is tzDF, and we display the predicted values versus actual values.

#Set up series object for partions in the dataframe
ONE_THROUGH_TWENTY = tzDF["measurements_height"].loc[1:21]
TWO_THROUGH_TWENTY_ONE= tzDF["measurements_height"].loc[2:22]
THREE_THROUGH_TWENTY_TWO =tzDF["measurements_height"].loc[3:23]
FOUR_THROUGH_TWENTY_THREE = tzDF["measurements_height"].loc[4:24]

#Renaming the series objects
ONE_THROUGH_TWENTY.rename({"measurements_height":"w"},axis =1,inplace=True)
TWO_THROUGH_TWENTY_ONE.rename({"measurements_height":"x"},axis =1,inplace=True)
THREE_THROUGH_TWENTY_TWO.rename({"measurements_height":"y"},axis =1,inplace=True)
FOUR_THROUGH_TWENTY_THREE.rename({"measurements_height":"z"},axis =1,inplace=True)

#Concating the series objects to one dataframe, result_DF
result_DF = pd.concat([ONE_THROUGH_TWENTY,TWO_THROUGH_TWENTY_ONE,THREE_THROUGH_TWENTY_TWO,FOUR_THROUGH_TWENTY_THREE],axis=1)
result_DF

#Modefying dataframe by shifting the coulmns up 
result_DF.iloc[:,1] = result_DF.iloc[:,1].shift(-1)
result_DF.iloc[:,2] = result_DF.iloc[:,2].shift(-2)
result_DF.iloc[:,3] = result_DF.iloc[:,3].shift(-3)

#Aligning all coulmns of the data frame together
result_DF = result_DF.dropna()
result_DF = result_DF.transpose()
result_DF

#Linear Regression on all rows and 1-19 coulmns 
lm = LinearRegression()
x = result_DF.iloc[:,0:20]
y = result_DF.iloc[:,20]
lm.fit(x,y)
y_pred1 = lm.predict(x)
plt.scatter(y,y_pred1)
plt.xlabel("actual height(m)")
plt.ylabel("predicted height(m)")
plt.title("actual height vs predicted height")
plt.show()
#REMOVE CODE
#print("This is the b value",lm.coef_,)
#print("This is the m value",lm.intercept_)
#print("This is the R^2",lm.score(x,y))
print(y-y_pred1)
type(FOUR_THROUGH_TWENTY_THREE)
../../_images/MM_01_Volcanic_activity_detection_and_noise_characterization_using_machine_learning_23_0.png
measurements_height    0.0
measurements_height    0.0
measurements_height    0.0
measurements_height    0.0
Name: 21, dtype: float64
pandas.core.series.Series

From the above graph, all points are predicted as the values overlap graphically with a b value of 0.0194. The height difference is zero. We will now increase the sample size of the data.

Method to increase sample size for the model

The method below takes in a data frame, goes into measurement height, and gets the first through nineteenth values in the height data frame. The first and nineteenth values are increased by one thousand times. The Data frame is transposed into columns then added to an empty list to be concatenated into a data frame.

#Making an empty list to store coulmn values
empty_list = []
def make_list(DataFrame):
    i = 0
    for i in range(1000):  
        change = tzDF["measurements_height"].iloc[i+1:i+21].to_frame().transpose()
        #Names of the columns
        change.columns = ["history_1","history_2","history_3","history_4","history_5",
                          "history_6","history_7","history_8","history_9","history_10",
                          "history_11","history_12","history_13","history_14","history_15",
                          "history_16","history_17","history_18","history_19","history_20"]
        change.index = [i]
        empty_list.append(change)
    return empty_list
# List of all columns values 
tzDF_list = make_list(tzDF["measurements_height"])
tzDF_two_list = make_list(tz2DF["measurements_height"])

#List iteration to combine all elements in list
finalDF = pd.concat([m for m in tzDF_list])
finalDF_two =  pd.concat([m for m in tzDF_two_list])
finalDF_two
history_1 history_2 history_3 history_4 history_5 history_6 history_7 history_8 history_9 history_10 history_11 history_12 history_13 history_14 history_15 history_16 history_17 history_18 history_19 history_20
0 988.206 988.214 988.226 988.199 988.220 988.212 988.202 988.167 988.167 988.126 988.129 988.132 988.151 988.135 988.108 988.114 988.129 988.169 988.162 988.147
1 988.214 988.226 988.199 988.220 988.212 988.202 988.167 988.167 988.126 988.129 988.132 988.151 988.135 988.108 988.114 988.129 988.169 988.162 988.147 988.143
2 988.226 988.199 988.220 988.212 988.202 988.167 988.167 988.126 988.129 988.132 988.151 988.135 988.108 988.114 988.129 988.169 988.162 988.147 988.143 988.146
3 988.199 988.220 988.212 988.202 988.167 988.167 988.126 988.129 988.132 988.151 988.135 988.108 988.114 988.129 988.169 988.162 988.147 988.143 988.146 988.142
4 988.220 988.212 988.202 988.167 988.167 988.126 988.129 988.132 988.151 988.135 988.108 988.114 988.129 988.169 988.162 988.147 988.143 988.146 988.142 988.136
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 988.100 988.106 988.131 988.121 988.129 988.129 988.146 988.148 988.131 988.130 988.145 988.112 988.127 988.140 988.114 988.122 988.151 988.164 988.140 988.143
996 988.106 988.131 988.121 988.129 988.129 988.146 988.148 988.131 988.130 988.145 988.112 988.127 988.140 988.114 988.122 988.151 988.164 988.140 988.143 988.146
997 988.131 988.121 988.129 988.129 988.146 988.148 988.131 988.130 988.145 988.112 988.127 988.140 988.114 988.122 988.151 988.164 988.140 988.143 988.146 988.132
998 988.121 988.129 988.129 988.146 988.148 988.131 988.130 988.145 988.112 988.127 988.140 988.114 988.122 988.151 988.164 988.140 988.143 988.146 988.132 988.133
999 988.129 988.129 988.146 988.148 988.131 988.130 988.145 988.112 988.127 988.140 988.114 988.122 988.151 988.164 988.140 988.143 988.146 988.132 988.133 988.117

2000 rows × 20 columns

Increased data points for linear regression

Below we will take the freshly made data frame finalDF with 200 rows x 20 columns, put the data into a linear regression utilizing the train test split module from the sci-kit library. The test sizes of 35, 55, and 75 are used for variability. The x is all of the rows in the new data frame and columns 1-19, while the or output is all of the rows and the 19th column that we are predicting.

35% Test Data demonstration

def make_prediction(DataFrame,number): 
    input_Data = finalDF.iloc[:,0:19]
    target_Data = finalDF.iloc[:,19]
    lm = LinearRegression()
    lm.fit(input_Data,target_Data)
    y_pred1 = lm.predict(input_Data)
    plt.scatter(target_Data,y_pred1)
    plt.xlabel("target height(m)")
    plt.ylabel("predicted height(m)")
    plt.title("Target vs Predicted")
    plt.scatter
    # x axis is actual height and y is what lm model is predicting in scatter
    X_train, X_test, y_train, y_test = train_test_split(input_Data, target_Data,test_size = number, random_state=50)
    model = lm.fit(X_train,y_train)
    prediction = lm.predict(X_test)
    #Series of the difference of the test the leng
    error_Series = y_test-prediction
    #average
    average_Difference = sum(error_Series/len(error_Series))
    #print(average_Difference)
    print(to_precision(average_Difference,4))

make_prediction(tzDF,0.35)
-2.296e-4
../../_images/MM_01_Volcanic_activity_detection_and_noise_characterization_using_machine_learning_31_1.png

When y_test-y_predicition, we get an average difference of -0.0003 from the model’s actual and predicted values in the above cell. From the outliers in the plot, we can view the noise graphically.

55% Test Data demonstration

#Function Call for 55% test data dem
make_prediction(tzDF,0.55)
-3.078e-4
../../_images/MM_01_Volcanic_activity_detection_and_noise_characterization_using_machine_learning_34_1.png

In the above cell, when y_test-y_predicition, we get an average difference of -0.00030778063659289 from the model’s actual and predicted values.

75% Test Data demonstration

#Function Call for 75% test data dem
make_prediction(tzDF,0.75)
-1.074e-4
../../_images/MM_01_Volcanic_activity_detection_and_noise_characterization_using_machine_learning_37_1.png

In the above cell when y_test-y_predicition we get an average difference of -0.0001 from the model’s actual and predicted values.

Using one day’s data to predict a different day’s data

The demonstrations below utilize the second data frame now named finalDF_two. The December 16, 2020 dates data will be trained to predict the April 16, 2020 dates data. The test size increases by ten percent in the range of 20-90 for variability. We will be using the Mean Squared value that shows the error for linear regression to view the model’s accuracy.

Prediction from 10% test size data

def different_Day_Prediction(DataFrame,number):
    DECEMBER_SERIES_X  = finalDF.iloc[:,0:19]
    DECEMBER_SERIES_Y = finalDF.iloc[:,19]
    APRIL_SERIES_X = finalDF_two.iloc[:,0:19]
    APRIL_SERIES_Y= finalDF_two.iloc[:,19]
    lm.fit(DECEMBER_SERIES_X,DECEMBER_SERIES_Y)
    y_pred1 = lm.predict(DECEMBER_SERIES_X)
    #train_test_split_module
    X_train, X_test, y_train, y_test = train_test_split(DECEMBER_SERIES_X, APRIL_SERIES_Y, test_size=number,random_state=50)
    model = lm.fit(X_train,y_train)
    model.intercept_
    model.coef_
    #setting the predicition variable 
    APRIL_PREDICTION = model.predict(APRIL_SERIES_X)
    #50 bins were picked for all of the models below
    #Distrubution of Errors pred vs actual
    plt.hist(APRIL_PREDICTION-APRIL_SERIES_Y,bins = 50,color ="black")
    plt.xlabel("meters(m)")
    plt.ylabel("distribution")
    plt.title("Distrbution of predicted minus actual")
    plt.show()
    #Displaying Mean Squared Error
    twenty_MSE = mean_squared_error(APRIL_SERIES_Y,APRIL_PREDICTION)
    print("The mean squared error for this model is",to_precision(twenty_MSE,4),"%.")

different_Day_Prediction(tzDF,.1)
../../_images/MM_01_Volcanic_activity_detection_and_noise_characterization_using_machine_learning_42_0.png
The mean squared error for this model is 8.325e-5 %.

Prediction from 20% test size data

In the cell below, december_Series_X and decemeber_Series_Y will be the data that is used to train the model. April_Series_X and april_Series_Y will be the different days and data that we predict it points from. We will display a histogram to view the error distribution between the actual and predicted data as well as

different_Day_Prediction(tzDF,.2)
../../_images/MM_01_Volcanic_activity_detection_and_noise_characterization_using_machine_learning_45_0.png
The mean squared error for this model is 8.344e-5 %.

In the above cell, the histogram displayed displays the error from the prediction minus the series. Most of the distribution of the data is centered around zero, indicating the performance of the model. The MSE is extremely low, showing the accuracy of the model.

Prediction from 30% test size data

In the cell below our we will be inputting in 30% test data.

different_Day_Prediction(tzDF,.3)
../../_images/MM_01_Volcanic_activity_detection_and_noise_characterization_using_machine_learning_49_0.png
The mean squared error for this model is 8.354e-5 %.

Prediction from 40% test size data

different_Day_Prediction(tzDF,.4)
../../_images/MM_01_Volcanic_activity_detection_and_noise_characterization_using_machine_learning_51_0.png
The mean squared error for this model is 8.383e-5 %.

Prediction from 50% test size data

different_Day_Prediction(tzDF,.5)
../../_images/MM_01_Volcanic_activity_detection_and_noise_characterization_using_machine_learning_53_0.png
The mean squared error for this model is 8.383e-5 %.

Prediction from 60% test size data

different_Day_Prediction(tzDF,.6)
../../_images/MM_01_Volcanic_activity_detection_and_noise_characterization_using_machine_learning_55_0.png
The mean squared error for this model is 8.487e-5 %.

Prediction from 70% test size data

different_Day_Prediction(tzDF,.7)
../../_images/MM_01_Volcanic_activity_detection_and_noise_characterization_using_machine_learning_57_0.png
The mean squared error for this model is 8.563e-5 %.

Prediction from 80% test size data

different_Day_Prediction(tzDF,.8)
../../_images/MM_01_Volcanic_activity_detection_and_noise_characterization_using_machine_learning_59_0.png
The mean squared error for this model is 8.615e-5 %.

Prediction from 90% test size data

different_Day_Prediction(tzDF,.9)
../../_images/MM_01_Volcanic_activity_detection_and_noise_characterization_using_machine_learning_61_0.png
The mean squared error for this model is 8.956e-5 %.

The mean square error for the above models is deficient, showing the success of inputting in height data for one day and predicting height data from another day.

Results

With this notebook we explored the use of machine learning for predictive analytics applied to vertical surface motions at an active volcano in Tanzania, Ol Doinyo Lengai. When training a model on one day and using that model to predict height values from another day, our lowest error was 8.325e-5 %. The future implications for this project and data are important because Ol Doinyo Lengai future eruption could cause distress for not only Tanzania, but other countries surrounding Tanzania. The ability to predict volcanic activity will be a valuable contribution to volcanic hazards assessment.

References

Daniels, Mike, Branko Kerkez, V. Chandrasekar, Sara Graves, D. Sarah Stamps, Charles Martin, Aaron Botnick, Michael Dye, Ryan Gooch, Josh Jones, Ken Keiser, Matthew Bartos, Thaovy Nguyen, Robyn Collins, Sophia Chen, Terrie Yang, Abbi Devins-Suresh (2016). Cloud-Hosted Real-time Data Services for the Geosciences (CHORDS) software (Version 1.0.1). UCAR/NCAR - EarthCube. https://doi.org/10.5065/d6v1236q

Kerkez, Branko, Michael Daniels, Sara Graves, V. Chandrasekar, Ken Keiser, Charlie Martin, Michael Dye, Manil Maskey, and Frank Vernon. “Cloud Hosted Real‐time Data Services for the Geosciences (CHORDS).” (2016), doi: 10.1002/gdj3.36.

GeÌ ron, Aureì lien. 2019. Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. 2nd ed. CA 95472: O’Reilly.

Stamps, D. S., Saria, E., Ji, K. H., Jones, J. R., Ntambila, D., Daniels, M. D., & Mencin, D. (2016). Real-time data from the Tanzania Volcano Observatory at the Ol Doinyo Lengai volcano in Tanzania (TZVOLCANO). UCAR/NCAR - EarthCube. https://doi.org/10.5065/D6P849BM