Link Prediction

In [1]:
import pandas
import seaborn
import numpy
import scipy
import matplotlib.pyplot as plt
import random


from sklearn import linear_model, metrics, model_selection, svm, preprocessing, neural_network, pipeline

pandas.set_option("display.max_columns", 101)
pandas.set_option("display.float_format", "{:.2f}".format)
In [2]:
data = pandas.read_csv("data/p1p2Corona.csv")

1.About the data

We first check how many rows and columns we have.

In [3]:
print(f"Rows:\t\t{data.shape[0]}\nColumns:\t{data.shape[1]}")
Rows:		166329
Columns:	48

What about data types?

Basic stats

In [4]:
data.describe()
Out[4]:
rweight ID(p1) p1averagedocyear p1.`betweennesscentrality` p1.`pagerank` p1.`COAUTHORcommunitylouvainall` p1.`authormsaid` p1.`estimatedtotalcitations` p1.`totalcitations` p1.`positioninauthorlist` p1.`idlastknownaffiliation` p1.`authoraffiliationid` p1.`defaultid` p1.`totalpublications` p1.`noorgorcoauthorperson` p1.`suspicious` ID(p2) p2averagedocyear p2.`betweennesscentrality` p2.`pagerank` p2.`COAUTHORcommunitylouvainall` p2.`authormsaid` p2.`estimatedtotalcitations` p2.`totalcitations` p2.`positioninauthorlist` p2.`idlastknownaffiliation` p2.`authoraffiliationid` p2.`defaultid` p2.`totalpublications` p2.`noorgorcoauthorperson` p2.`suspicious` adamicAdarscore commonNeighborsscore preferentialAttachmentscore resourceAllocationscore totalNeighborsscore scorecommunitylouvainall topicsimilarity
count 166329.00 166329.00 166329.00 166329.00 166329.00 166329.00 166329.00 166329.00 166329.00 166329.00 133028.00 93761.00 166329.00 166329.00 0.00 0.00 166329.00 166329.00 166329.00 166329.00 166329.00 166329.00 166329.00 166329.00 166329.00 126239.00 91652.00 166329.00 166329.00 0.00 0.00 166329.00 166329.00 166329.00 166329.00 166329.00 166329.00 166329.00
mean 1.25 9602.68 2010.12 212452.92 2.21 12449.13 2313166345.15 4349.05 3141.80 7.14 528778051.07 523430111.36 2313166345.15 78.16 nan nan 13332.01 2010.14 138274.75 1.61 12546.78 2379175699.22 3481.15 2487.74 10.26 527309872.54 515541346.48 2379175699.22 63.36 nan nan 4.98 16.64 2412.12 0.68 56.00 0.94 0.53
std 0.99 8428.09 7.35 937849.26 3.44 7980.30 561476884.80 11747.42 8418.45 7.71 753899480.52 738662251.16 561476884.80 190.34 nan nan 9068.08 7.55 774374.37 2.72 7951.81 536565982.55 11310.81 7771.72 9.86 750631649.99 729359661.77 536565982.55 168.64 nan nan 3.41 15.00 8923.39 0.31 79.89 0.24 0.39
min 1.00 11.00 1970.00 0.00 0.18 298.00 525113.00 0.00 0.00 1.00 46017.00 52325.00 525113.00 1.00 nan nan 12.00 1970.00 0.00 0.17 298.00 525113.00 0.00 0.00 1.00 46017.00 52325.00 525113.00 1.00 nan nan 0.00 0.00 1.00 0.00 2.00 0.00 0.00
25% 1.00 1930.00 2005.00 0.00 0.93 5502.00 2113892282.00 36.00 35.00 2.00 74801974.00 72212903.00 2113892282.00 3.00 nan nan 6484.00 2005.00 0.00 0.75 5502.00 2129270150.00 19.00 19.00 4.00 72212903.00 66837359.50 2129270150.00 2.00 nan nan 2.79 7.00 121.00 0.50 13.00 1.00 0.16
50% 1.00 7370.00 2010.29 0.50 1.06 12495.00 2306139023.00 438.00 343.00 5.00 157773358.00 157536573.00 2306139023.00 17.00 nan nan 12777.00 2010.08 0.00 0.97 12644.00 2478031315.00 212.00 173.00 7.00 157773358.00 157536573.00 2478031315.00 9.00 nan nan 4.00 12.00 440.00 0.70 26.00 1.00 0.44
75% 1.00 15504.00 2016.00 28928.00 1.94 18517.00 2671671651.00 3291.00 2405.00 9.00 889458895.00 889458895.00 2671671651.00 78.00 nan nan 19191.00 2017.00 1453.87 1.24 18517.00 2716000858.00 1952.00 1477.00 13.00 912377674.00 901861585.00 2716000858.00 57.00 nan nan 6.11 20.00 1800.00 0.85 65.00 1.00 1.00
max 56.00 61669.00 2020.00 13119691.78 31.65 29384.00 3010639452.00 375086.00 258534.00 67.00 3004594783.00 2898336195.00 3010639452.00 5332.00 nan nan 61670.00 2020.00 13119691.78 31.65 29384.00 3010639452.00 375086.00 258534.00 67.00 3004594783.00 2898336195.00 3010639452.00 5332.00 nan nan 83.47 257.00 649368.00 13.00 936.00 1.00 1.00

2.Data Cleanup

We start by cleaning column names:

We will drop:

  • Columns with "duplicate" information.
  • Columns containing only Nan.
In [5]:
def clean_column_names(dataframe):
    dataframe.columns = dataframe.columns.str.replace("`", "").str.replace(".", "")
    return dataframe

data = clean_column_names(data)
In [6]:
def clean_empty_cols(dataframe):
    cols_to_drop = ["p1originalaffiliationname", "p1originalauthorname",  "p2originalaffiliationname", "p2originalauthorname", 'ID(p1)', 'ID(p2)']
        
    for column_name in list(dataframe):
        if dataframe[column_name].isnull().all() == True:
            print(f"'{column_name}' is empty and will be dropped.")
            cols_to_drop.append(column_name)
    
    for col in cols_to_drop:
        if col in list(dataframe):
            dataframe = dataframe.drop(col, axis=1)
    
    return dataframe

    
data = clean_empty_cols(data)
'p1noorgorcoauthorperson' is empty and will be dropped.
'p1suspicious' is empty and will be dropped.
'p2noorgorcoauthorperson' is empty and will be dropped.
'p2suspicious' is empty and will be dropped.

Lets encode the string columns:

In [7]:
def encode_dataframe(dataframe):
    encoding_dict = dict()
    cols_to_encode = ["p1authoraffiliationname", 
                      "p1normauthorname", 
                      "p1lastknownaffiliation",  
                      "p2authoraffiliationname", 
                      "p2normauthorname", 
                      "p2lastknownaffiliation"]
    for column in cols_to_encode:    
        dataframe[column] = dataframe[column].fillna("unknown")

        le = preprocessing.LabelEncoder()
        le.fit(dataframe[column])
        
        dataframe[column] = le.fit_transform(dataframe[column])
        encoding_dict[column] = le     
    return dataframe, encoding_dict, cols_to_encode

def decode_dataframe(dataframe, encoding_dict, encoded_cols):
    for encoded_col in encoded_cols:
        dataframe[encoded_col] = encoding_dict[encoded_col].inverse_transform(dataframe[encoded_col])
    return dataframe
In [8]:
data, data_encoder_dict, data_encoded_cols = encode_dataframe(data)

All ID columns should be ints:

In [9]:
def make_cols_int(dataframe):
    cols = ["p1idlastknownaffiliation",
            "p1authoraffiliationid",
            "p2idlastknownaffiliation",
            "p2authoraffiliationid"]
    for col in cols:
        dataframe[col] = dataframe[col].fillna(0).astype(int)
        
    return dataframe
    
data = make_cols_int(data)

Let's see what we have in the end:

In [10]:
data.sample(10)
Out[10]:
rweight p1averagedocyear p1betweennesscentrality p1pagerank p1COAUTHORcommunitylouvainall p1authoraffiliationname p1authormsaid p1estimatedtotalcitations p1normauthorname p1lastknownaffiliation p1totalcitations p1positioninauthorlist p1idlastknownaffiliation p1authoraffiliationid p1defaultid p1totalpublications p2averagedocyear p2betweennesscentrality p2pagerank p2COAUTHORcommunitylouvainall p2authoraffiliationname p2authormsaid p2estimatedtotalcitations p2normauthorname p2lastknownaffiliation p2totalcitations p2positioninauthorlist p2idlastknownaffiliation p2authoraffiliationid p2defaultid p2totalpublications adamicAdarscore commonNeighborsscore preferentialAttachmentscore resourceAllocationscore totalNeighborsscore scorecommunitylouvainall topicsimilarity
148896 1 1987.23 483436.13 4.78 11372 719 2340576187 7 3396 1064 7 4 9364636 9364636 2340576187 2 1992.00 0.00 0.90 11372 729 2559965316 24 14751 1079 24 4 9364636 9364636 2559965316 1 2.84 6 371 0.73 36 1 0.07
120014 1 2017.00 0.00 0.74 1521 296 2676123000 60 2928 10 60 2 2802541053 72212903 2676123000 5 2008.50 511665.81 3.23 1521 284 2165887540 10207 14257 433 7272 20 92039509 92039509 2165887540 491 5.16 18 1843 0.59 73 1 0.16
103785 1 2003.00 0.00 0.94 10166 1167 2937642937 51 21050 1790 32 2 0 0 2937642937 1 2003.00 0.00 0.94 10166 1211 2620586433 51 12202 1830 32 9 0 0 2620586433 1 6.20 19 400 0.90 21 1 1.00
107166 1 2004.00 37.50 1.12 18097 258 2256776460 109 7995 395 67 6 66068411 66068411 2256776460 2 2004.00 0.00 0.58 18097 264 2097158061 776 10148 402 619 7 66068411 66068411 2097158061 7 2.89 9 170 0.42 18 1 0.52
8123 1 2005.40 82689.49 2.63 10022 1147 1951662471 5615 14985 303 4049 59 186335123 212119943 1951662471 49 2010.67 24231.39 1.86 10022 1177 1991219967 5096 8238 1671 3966 9 185261750 185261750 1991219967 57 6.64 25 3880 0.62 102 1 0.15
96342 1 2007.33 269.11 0.82 16592 1167 2643370531 144 20752 1261 118 6 99065089 0 2643370531 7 2004.00 49785.04 1.51 17829 1211 2155915228 11917 15949 227 8514 7 19820366 0 2155915228 249 2.74 9 414 0.38 25 0 0.19
1484 1 2006.13 2888982.17 17.45 22917 1006 2109064762 61717 20252 1449 42698 4 889458895 889458895 2109064762 420 2005.00 0.00 0.45 22917 1211 2473499176 1559 3300 1490 1012 1 889458895 0 2473499176 7 2.50 9 4628 0.35 349 1 0.02
16285 1 2013.38 118793.72 3.05 4996 302 1214858697 13038 15270 1000 9428 11 348769827 1338334475 1214858697 353 2013.00 0.00 1.18 4996 1211 2678577151 107 14402 1830 71 55 0 0 2678577151 1 14.39 61 6464 0.89 97 1 0.13
123555 1 2004.00 0.00 0.59 22917 1167 2648892066 74 14445 1790 49 13 0 0 2648892066 1 2009.12 8434922.97 30.31 22917 1040 2903265382 81 10803 1708 54 6 25974101 889458895 2903265382 15 4.06 15 19836 0.46 532 1 0.01
165747 1 2003.00 0.00 0.96 29075 1167 2398126245 0 19748 1790 0 9 0 0 2398126245 1 2003.00 0.00 0.96 29075 1211 2696253038 0 22711 1830 0 12 0 0 2696253038 3 5.21 16 361 0.75 18 1 1.00

3.Create a cosine similarity column

We start by creating a function to create cosine similarity between two ids.

In [11]:
def calculate_cos_sim(p1_id, p2_id, dataframe):
    a = dataframe.loc[[p1_id]].to_numpy().flatten()
    b = dataframe.loc[[p2_id]].to_numpy().flatten()
    cos_distance = scipy.spatial.distance.cosine(a, b)
    cos_similarity = 1 - cos_distance
    return cos_similarity

We import the embedding data, in this case with 128 dimensions.

In [12]:
embedding_data = pandas.read_csv("data/corona_128.emd", sep=" ", header=None, skiprows=1, index_col=0)

We create the column in an iterative way:

In [13]:
cosine_similarity_column = []
for index, row in data.iterrows():
    p1_id = int(row["p1defaultid"])
    p2_id = int(row["p2defaultid"])
    cosine_similarity_column.append(calculate_cos_sim(p1_id, p2_id, embedding_data))

And we assing to our original dataframe.

In [14]:
data["cosine_similarity"] = cosine_similarity_column

4.Basic Stats

What is the distribution of our variable?

In [15]:
seaborn.set_style("white")
plt.figure(figsize=(8, 5))
plt.title("Distribution of 'rweight' variable")
seaborn.distplot(data.rweight, hist=True, bins=10)
plt.xlabel("Rweight value")
plt.ylabel("Frequency")
#plt.xlim(0, 20)
plt.show()

What are the highest correlated variables with rweight?

In [16]:
threshold = 0.1
for label, correlation_score in data.corr().rweight.sort_values(ascending=False).iteritems():
    if abs(correlation_score) >= threshold and label != "rweight":
        print(f"{label} with {correlation_score:.2f} correlation score.")
resourceAllocationscore with 0.50 correlation score.
preferentialAttachmentscore with 0.32 correlation score.
adamicAdarscore with 0.29 correlation score.
p2pagerank with 0.22 correlation score.
commonNeighborsscore with 0.21 correlation score.
totalNeighborsscore with 0.18 correlation score.
p1pagerank with 0.17 correlation score.
p2betweennesscentrality with 0.14 correlation score.
p1betweennesscentrality with 0.10 correlation score.
cosine_similarity with -0.11 correlation score.
topicsimilarity with -0.20 correlation score.
In [17]:
# Takes to long, we can consider doing something link sampling 10K rows
# plt.figure(figsize=(20,20))
# seaborn.clustermap(data.corr(), cmap=plt.cm.Reds) OR seaborn.clustermap(data.sample(10000), cmap=plt.cm.Reds)
# plt.show()

Correlation heatmap for variables (absolute correlation):

In [18]:
plt.figure(figsize=(15,15))
seaborn.heatmap(abs(data.corr()), 
                cmap=plt.cm.Purples, 
                robust=False, 
                linewidths=0.1, 
                linecolor="white",
                cbar_kws={'label': 'Correlation Value'})
plt.title("Correlation heatmap between dataset variables")
plt.show()

5.Create test and train set

We start be definning variables:

In [19]:
to_predict = "rweight" 
test_size = 0.33
shuffle = True
cross_validation_folds = 5
n_values_to_plot = 500

And we create our X and y:

In [20]:
# ONLY HIGHLY CORRELATED COLUMNS -> RESULTS ARE SIMILAR
# corr_thres = 0.2
# corr_columns = [column_name 
#                 for column_name in list(data) 
#                 if abs(data.rweight.corr(data[column_name])) > corr_thres 
#                 and column_name != "rweight"]
# X = data.loc[:, corr_columns]

X = data.loc[:, data.columns != to_predict]
y = data[to_predict]

As well as our train and test splits:

In [21]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, 
                                                    y, 
                                                    test_size=test_size,  
                                                    shuffle=shuffle)

print(f"X_train: {X_train.shape}")
print(f"y_train: {y_train.shape}")
print()
print(f"X_test: {X_test.shape}")
print(f"y_test: {y_test.shape}")
X_train: (111440, 38)
y_train: (111440,)

X_test: (54889, 38)
y_test: (54889,)
In [22]:
model = "linear regression"

We fit a linear regression:

In [23]:
lr = linear_model.LinearRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)

print(f"MAE: {metrics.mean_absolute_error(y_test, y_pred):.4f}")
print(f"MSE: {metrics.mean_squared_error(y_test, y_pred):.4f}")
print(f"RMSE: {metrics.mean_squared_error(y_test, y_pred, squared=False):.4f}")
print(f"R^2: {metrics.r2_score(y_test, y_pred):.4f}")
MAE: 0.4000
MSE: 0.5144
RMSE: 0.7172
R^2: 0.4409

Visualize predictions:

In [24]:
indexes = random.sample(range(len(y_test)), k=n_values_to_plot)
plt.figure(figsize=(12,4))
plt.title(f"{model} prediction on {n_values_to_plot} samples")
x_ax = range(n_values_to_plot)
plt.plot(x_ax, [y_test.to_list()[i] for i in indexes], color='black', label="Test", alpha=0.8)
plt.ylabel(to_predict)
plt.xlabel(f"Test Samples")
plt.ylabel(f"$Rweight$ value")
plt.plot(x_ax, [y_pred[i] for i in indexes], ".", color='red', label="Prediction")
plt.xlim(0, n_values_to_plot)
plt.legend()
Out[24]:
<matplotlib.legend.Legend at 0x12aa1aad0>

What are the most important features?

In [25]:
plt.figure(figsize=(10,5))
plt.title(f"Attribute importance with {model}")
coefs = dict(zip(list(X_train), lr.coef_))
plt.bar(range(len(coefs)), list(coefs.values()), align='center')
plt.xticks(range(len(coefs)), list(coefs.keys()), rotation=90)
plt.ylabel("Coefficient Importance")
plt.show()
In [26]:
model = "Elastic Net"

We fit an Elastic Net Model

Here are some alpha values we will use:

In [27]:
alphas = [0.0001, 0.001, 0.01, 0.1, 0.8]
In [28]:
elastic = linear_model.ElasticNetCV(alphas=alphas, cv=cross_validation_folds, tol=0.5).fit(X_train, y_train)
y_pred = elastic.predict(X_test)

print(f"Selected alpha from cross validation: {elastic.alpha_}")
print()
print(f"MAE: {metrics.mean_absolute_error(y_test, y_pred):.4f}")
print(f"MSE: {metrics.mean_squared_error(y_test, y_pred):.4f}")
print(f"RMSE: {metrics.mean_squared_error(y_test, y_pred, squared=False):.4f}")
print(f"R^2: {metrics.r2_score(y_test, y_pred):.4f}")
Selected alpha from cross validation: 0.0001

MAE: 0.4081
MSE: 0.5518
RMSE: 0.7428
R^2: 0.4003

Visualize predictions:

In [29]:
indexes = random.sample(range(len(y_test)), k=n_values_to_plot)
plt.figure(figsize=(12,4))
plt.title(f"{model} prediction on {n_values_to_plot} samples")
x_ax = range(n_values_to_plot)
plt.plot(x_ax, [y_test.to_list()[i] for i in indexes], color='black', label="Test", alpha=0.8)
plt.ylabel(to_predict)
plt.xlabel(f"Test Samples")
plt.ylabel(f"$Rweight$ value")
plt.plot(x_ax, [y_pred[i] for i in indexes], ".", color='red', label="Prediction")
plt.xlim(0, n_values_to_plot)
plt.legend()
Out[29]:
<matplotlib.legend.Legend at 0x12a328950>

What are the most important features?

In [30]:
plt.figure(figsize=(10,5))
plt.title(f"Attribute importance with {model}")
coefs = dict(zip(list(X_train), lr.coef_))
plt.bar(range(len(coefs)), list(coefs.values()), align='center')
plt.xticks(range(len(coefs)), list(coefs.keys()), rotation=90)
plt.ylabel("Coefficient Importance")
plt.show()
In [31]:
model = "MLP Regressor"

We fit a neural network

In [32]:
mlp = pipeline.make_pipeline(preprocessing.StandardScaler(),
                    neural_network.MLPRegressor(hidden_layer_sizes=(100, 100),
                                 tol=1e-2, max_iter=500, random_state=0)).fit(X_train, y_train)
y_pred = mlp.predict(X_test)


print(f"MAE: {metrics.mean_absolute_error(y_test, y_pred):.4f}")
print(f"MSE: {metrics.mean_squared_error(y_test, y_pred):.4f}")
print(f"RMSE: {metrics.mean_squared_error(y_test, y_pred, squared=False):.4f}")
print(f"R^2: {metrics.r2_score(y_test, y_pred):.4f}")
MAE: 0.2411
MSE: 0.3832
RMSE: 0.6190
R^2: 0.5836

Visualize predictions:

In [33]:
indexes = random.sample(range(len(y_test)), k=n_values_to_plot)
plt.figure(figsize=(12,4))
plt.title(f"{model} prediction on {n_values_to_plot} samples")
x_ax = range(n_values_to_plot)
plt.plot(x_ax, [y_test.to_list()[i] for i in indexes], color='black', label="Test", alpha=0.8)
plt.ylabel(to_predict)
plt.xlabel(f"Test Samples")
plt.ylabel(f"$Rweight$ value")
plt.plot(x_ax, [y_pred[i] for i in indexes], ".", color='red', label="Prediction")
plt.xlim(0, n_values_to_plot)
plt.legend()
Out[33]:
<matplotlib.legend.Legend at 0x12aa24810>

What are the most important features?

In [34]:
mlp.steps[1][1]
Out[34]:
MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(100, 100), learning_rate='constant',
             learning_rate_init=0.001, max_fun=15000, max_iter=500,
             momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
             power_t=0.5, random_state=0, shuffle=True, solver='adam', tol=0.01,
             validation_fraction=0.1, verbose=False, warm_start=False)

9.Validation

We start by loading the data:

In [35]:
v_data = pandas.read_csv("data/CSVforMLenjuanesComplete2.csv")

We then clean the data, using the same functions as before.

In [36]:
v_data = clean_column_names(v_data)
v_data = clean_empty_cols(v_data)
v_data, v_data_encoder_dict, v_data_encoded_cols = encode_dataframe(v_data)
v_data = make_cols_int(v_data)
'p1noorgorcoauthorperson' is empty and will be dropped.
'p1suspicious' is empty and will be dropped.

Replace Nan with 0.

In [37]:
v_data.rweight.fillna(0, inplace=True)

Create the cosine similarity column.

In [39]:
cosine_similarity_column = []
non_found = []
for index, row in v_data.iterrows():
    p1_id = int(row["p1defaultid"])
    p2_id = int(row["p2defaultid"])
    try:
        cosine_similarity_column.append(calculate_cos_sim(p1_id, p2_id, embedding_data))
    except Exception as e:
        cosine_similarity_column.append(numpy.nan)
        non_found.append(p2_id)

And we assing to our original dataframe.

In [40]:
v_data["cosine_similarity"] = cosine_similarity_column
In [41]:
total = len(cosine_similarity_column)
nans_count = cosine_similarity_column.count(numpy.nan)
empty_per = nans_count / total * 100

print(f"From the pairs in validation, in {empty_per:.2f}% ({nans_count} total nodes) we did not find the p2 embedding, and the column will be deleted.")
From the pairs in validation, in 6.83% (2009 total nodes) we did not find the p2 embedding, and the column will be deleted.

If we did not find cosine similarity, delete the row.

In [42]:
v_data = v_data[v_data['cosine_similarity'].notna()]

Check the head:

In [43]:
v_data.head()
Out[43]:
rweight p1averagedocyear p1betweennesscentrality p1pagerank p1COAUTHORcommunitylouvainall p1authoraffiliationname p1authormsaid p1estimatedtotalcitations p1normauthorname p1lastknownaffiliation p1totalcitations p1positioninauthorlist p1idlastknownaffiliation p1authoraffiliationid p1defaultid p1totalpublications p2averagedocyear p2betweennesscentrality p2pagerank p2COAUTHORcommunitylouvainall p2authoraffiliationname p2authormsaid p2estimatedtotalcitations p2normauthorname p2lastknownaffiliation p2totalcitations p2positioninauthorlist p2idlastknownaffiliation p2authoraffiliationid p2defaultid p2totalpublications p2noorgorcoauthorperson p2suspicious adamicAdarscore commonNeighborsscore preferentialAttachmentscore resourceAllocationscore totalNeighborsscore scorecommunitylouvainall topicsimilarity cosine_similarity
0 0.00 2005.97 9160519.16 23.44 12644 0 2234294888 11249 0 0 8424 6 134820265 63634437 2234294888 220 2006.24 766328.40 10.33 4912 1294 1442280916 87700 705 72 62300 4 114983960 0 1442280916 1007 NaN NaN 1.90 10 52026 0.06 447 0 0.04 0.48
1 0.00 2005.97 9160519.16 23.44 12644 0 2234294888 11249 0 0 8424 6 134820265 63634437 2234294888 220 2012.25 3921.93 1.45 4912 1294 2138153055 3987 875 358 2935 1 2801952686 0 2138153055 6 NaN NaN 1.46 8 9867 0.04 309 0 0.02 0.53
2 0.00 2005.97 9160519.16 23.44 12644 0 2234294888 11249 0 0 8424 6 134820265 63634437 2234294888 220 2009.50 167905.46 2.80 4912 236 2470752781 17221 22025 359 12091 3 913958620 913958620 2470752781 120 NaN NaN 1.46 8 16744 0.04 332 0 0.02 0.49
3 1.00 2005.97 9160519.16 23.44 12644 0 2234294888 11249 0 0 8424 6 134820265 63634437 2234294888 220 2007.69 2639394.09 6.35 4912 1294 713325311 58212 19185 359 40729 5 913958620 0 713325311 437 NaN NaN 3.92 19 51129 0.21 435 0 0.02 0.46
4 0.00 2005.97 9160519.16 23.44 12644 0 2234294888 11249 0 0 8424 6 134820265 63634437 2234294888 220 2014.33 26593.81 1.06 18517 236 59592779 3583 19866 781 2710 2 2800006345 913958620 59592779 20 NaN NaN 0.94 5 6578 0.03 301 0 0.01 0.49

Create a X and y set:

In [44]:
X_v_data = v_data[list(X_test)]
y_v_data = v_data.rweight

Make prediction columns:

In [45]:
v_data["rweight_linear_reg"] = lr.predict(X_v_data)
v_data["rweight_elastic_reg"] = elastic.predict(X_v_data)
v_data["rweight_neural_reg"] = mlp.predict(X_v_data)

Decode encoded columns:

In [46]:
v_data = decode_dataframe(v_data, v_data_encoder_dict, v_data_encoded_cols)

Limit cols for visualization:

In [47]:
interesting_cols = ["p1normauthorname", 
                    "p1authoraffiliationname", 
                    "p2normauthorname", 
                    "p2authoraffiliationname",
                    "rweight"]
In [48]:
v_data.to_csv("data/validation.csv")

9.1.Linear Regression

In [49]:
top = 15
col_in_focus = "rweight_linear_reg"
v_data["delta"] = v_data.rweight - v_data[col_in_focus]
cols_in_focus = interesting_cols + [col_in_focus] + ["delta"]

Check the top cols by linear regression prediction:

In [50]:
v_data[cols_in_focus].sort_values(by=col_in_focus, ascending=False).head(top)
Out[50]:
p1normauthorname p1authoraffiliationname p2normauthorname p2authoraffiliationname rweight rweight_linear_reg delta
2479 luis enjuanes autonomous university of madrid isabel sola spanish national research council 41.00 15.80 25.20
1221 luis enjuanes autonomous university of madrid marta l dediego spanish national research council 18.00 8.01 9.99
3651 luis enjuanes autonomous university of madrid carlos sanchez autonomous university of madrid 2.00 7.55 -5.55
3650 luis enjuanes autonomous university of madrid carlos sanchez autonomous university of madrid 12.00 7.55 4.45
435 luis enjuanes autonomous university of madrid stanley perlman university of iowa 8.00 7.05 0.95
1295 luis enjuanes autonomous university of madrid fernando almazan autonomous university of madrid 26.00 6.60 19.40
1296 luis enjuanes autonomous university of madrid fernando almazan autonomous university of madrid 1.00 6.60 -5.60
6659 luis enjuanes autonomous university of madrid sonia zuniga spanish national research council 20.00 5.44 14.56
8661 luis enjuanes autonomous university of madrid cristian smerdou spanish national research council 7.00 5.37 1.63
429 luis enjuanes autonomous university of madrid ralph s baric university of north carolina at chapel hill 1.00 5.27 -4.27
430 luis enjuanes autonomous university of madrid ralph s baric university of north carolina at chapel hill 3.00 5.27 -2.27
428 luis enjuanes autonomous university of madrid ralph s baric university of north carolina at chapel hill 4.00 5.27 -1.27
2477 luis enjuanes autonomous university of madrid raul fernandezdelgado spanish national research council 7.00 5.20 1.80
3655 luis enjuanes autonomous university of madrid javier ortego unknown 8.00 5.14 2.86
3653 luis enjuanes autonomous university of madrid david escors unknown 9.00 4.99 4.01

And the top columns by delta:

In [51]:
v_data[cols_in_focus].sort_values(by="delta", ascending=False).head(top)
Out[51]:
p1normauthorname p1authoraffiliationname p2normauthorname p2authoraffiliationname rweight rweight_linear_reg delta
2479 luis enjuanes autonomous university of madrid isabel sola spanish national research council 41.00 15.80 25.20
1295 luis enjuanes autonomous university of madrid fernando almazan autonomous university of madrid 26.00 6.60 19.40
6659 luis enjuanes autonomous university of madrid sonia zuniga spanish national research council 20.00 5.44 14.56
1221 luis enjuanes autonomous university of madrid marta l dediego spanish national research council 18.00 8.01 9.99
1219 luis enjuanes autonomous university of madrid jose l nietotorres spanish national research council 11.00 4.23 6.77
3658 luis enjuanes autonomous university of madrid sara alonso spanish national research council 10.00 3.57 6.43
6657 luis enjuanes autonomous university of madrid silvia marquezjurado spanish national research council 7.00 2.30 4.70
1297 luis enjuanes autonomous university of madrid carmen galan unknown 8.00 3.43 4.57
1220 luis enjuanes autonomous university of madrid enrique alvarez spanish national research council 9.00 4.45 4.55
3650 luis enjuanes autonomous university of madrid carlos sanchez autonomous university of madrid 12.00 7.55 4.45
3653 luis enjuanes autonomous university of madrid david escors unknown 9.00 4.99 4.01
5497 luis enjuanes autonomous university of madrid jose m gonzalez spanish national research council 6.00 2.25 3.75
8090 luis enjuanes autonomous university of madrid aitor nogales autonomous university of madrid 7.00 3.26 3.74
9042 luis enjuanes autonomous university of madrid jose m jimenezguardeno autonomous university of madrid 7.00 3.67 3.33
1391 luis enjuanes autonomous university of madrid jose a reglanava spanish national research council 7.00 3.71 3.29

And the bottom columns by delta:

In [52]:
v_data[cols_in_focus].sort_values(by="delta", ascending=True).head(top)
Out[52]:
p1normauthorname p1authoraffiliationname p2normauthorname p2authoraffiliationname rweight rweight_linear_reg delta
1296 luis enjuanes autonomous university of madrid fernando almazan autonomous university of madrid 1.00 6.60 -5.60
3651 luis enjuanes autonomous university of madrid carlos sanchez autonomous university of madrid 2.00 7.55 -5.55
33 luis enjuanes autonomous university of madrid kwokyung yuen university of hong kong 0.00 4.91 -4.91
429 luis enjuanes autonomous university of madrid ralph s baric university of north carolina at chapel hill 1.00 5.27 -4.27
65 luis enjuanes autonomous university of madrid christian drosten bernhard nocht institute for tropical medicine 1.00 4.82 -3.82
312 luis enjuanes autonomous university of madrid peter j m rottier utrecht university 0.00 3.59 -3.59
34 luis enjuanes autonomous university of madrid kwokhung chan li ka shing faculty of medicine university of ... 0.00 3.40 -3.40
1168 luis enjuanes autonomous university of madrid linda j saif ohio state university 0.00 3.06 -3.06
1035 luis enjuanes autonomous university of madrid georg herrler unknown 1.00 4.04 -3.04
336 luis enjuanes autonomous university of madrid volker thiel university of zurich 0.00 2.95 -2.95
3735 luis enjuanes autonomous university of madrid m l ballesteros spanish national research council 1.00 3.82 -2.82
1615 luis enjuanes autonomous university of madrid paul britton newbury college 0.00 2.81 -2.81
1649 luis enjuanes autonomous university of madrid benjamin w neuman texas a m university texarkana 0.00 2.76 -2.76
288 luis enjuanes autonomous university of madrid marcel a muller university of bonn 0.00 2.74 -2.74
2302 luis enjuanes autonomous university of madrid ding xiang liu nanyang technological university 0.00 2.66 -2.66

9.2.Elastic Regression

In [53]:
top = 15
col_in_focus = "rweight_elastic_reg"
v_data["delta"] = v_data.rweight - v_data[col_in_focus]
cols_in_focus = interesting_cols + [col_in_focus] + ["delta"]

Check the top cols by linear regression prediction:

In [54]:
v_data[cols_in_focus].sort_values(by=col_in_focus, ascending=False).head(top)
Out[54]:
p1normauthorname p1authoraffiliationname p2normauthorname p2authoraffiliationname rweight rweight_elastic_reg delta
2479 luis enjuanes autonomous university of madrid isabel sola spanish national research council 41.00 14.86 26.14
1221 luis enjuanes autonomous university of madrid marta l dediego spanish national research council 18.00 7.80 10.20
3651 luis enjuanes autonomous university of madrid carlos sanchez autonomous university of madrid 2.00 7.46 -5.46
3650 luis enjuanes autonomous university of madrid carlos sanchez autonomous university of madrid 12.00 7.46 4.54
435 luis enjuanes autonomous university of madrid stanley perlman university of iowa 8.00 6.55 1.45
1295 luis enjuanes autonomous university of madrid fernando almazan autonomous university of madrid 26.00 6.44 19.56
1296 luis enjuanes autonomous university of madrid fernando almazan autonomous university of madrid 1.00 6.44 -5.44
430 luis enjuanes autonomous university of madrid ralph s baric university of north carolina at chapel hill 3.00 6.07 -3.07
429 luis enjuanes autonomous university of madrid ralph s baric university of north carolina at chapel hill 1.00 6.07 -5.07
428 luis enjuanes autonomous university of madrid ralph s baric university of north carolina at chapel hill 4.00 6.07 -2.07
8661 luis enjuanes autonomous university of madrid cristian smerdou spanish national research council 7.00 5.29 1.71
6659 luis enjuanes autonomous university of madrid sonia zuniga spanish national research council 20.00 5.25 14.75
2477 luis enjuanes autonomous university of madrid raul fernandezdelgado spanish national research council 7.00 5.18 1.82
3655 luis enjuanes autonomous university of madrid javier ortego unknown 8.00 5.10 2.90
3653 luis enjuanes autonomous university of madrid david escors unknown 9.00 4.96 4.04

And the top columns by difference between prediction and reality:

In [55]:
v_data[cols_in_focus].sort_values(by="delta", ascending=False).head(top)
Out[55]:
p1normauthorname p1authoraffiliationname p2normauthorname p2authoraffiliationname rweight rweight_elastic_reg delta
2479 luis enjuanes autonomous university of madrid isabel sola spanish national research council 41.00 14.86 26.14
1295 luis enjuanes autonomous university of madrid fernando almazan autonomous university of madrid 26.00 6.44 19.56
6659 luis enjuanes autonomous university of madrid sonia zuniga spanish national research council 20.00 5.25 14.75
1221 luis enjuanes autonomous university of madrid marta l dediego spanish national research council 18.00 7.80 10.20
1219 luis enjuanes autonomous university of madrid jose l nietotorres spanish national research council 11.00 4.19 6.81
3658 luis enjuanes autonomous university of madrid sara alonso spanish national research council 10.00 3.55 6.45
1220 luis enjuanes autonomous university of madrid enrique alvarez spanish national research council 9.00 4.40 4.60
3650 luis enjuanes autonomous university of madrid carlos sanchez autonomous university of madrid 12.00 7.46 4.54
6657 luis enjuanes autonomous university of madrid silvia marquezjurado spanish national research council 7.00 2.49 4.51
1297 luis enjuanes autonomous university of madrid carmen galan unknown 8.00 3.52 4.48
3653 luis enjuanes autonomous university of madrid david escors unknown 9.00 4.96 4.04
5497 luis enjuanes autonomous university of madrid jose m gonzalez spanish national research council 6.00 2.33 3.67
8090 luis enjuanes autonomous university of madrid aitor nogales autonomous university of madrid 7.00 3.46 3.54
1391 luis enjuanes autonomous university of madrid jose a reglanava spanish national research council 7.00 3.75 3.25
9042 luis enjuanes autonomous university of madrid jose m jimenezguardeno autonomous university of madrid 7.00 3.78 3.22

And the bottom columns by delta:

In [56]:
v_data[cols_in_focus].sort_values(by="delta", ascending=True).head(top)
Out[56]:
p1normauthorname p1authoraffiliationname p2normauthorname p2authoraffiliationname rweight rweight_elastic_reg delta
3651 luis enjuanes autonomous university of madrid carlos sanchez autonomous university of madrid 2.00 7.46 -5.46
1296 luis enjuanes autonomous university of madrid fernando almazan autonomous university of madrid 1.00 6.44 -5.44
429 luis enjuanes autonomous university of madrid ralph s baric university of north carolina at chapel hill 1.00 6.07 -5.07
33 luis enjuanes autonomous university of madrid kwokyung yuen university of hong kong 0.00 4.32 -4.32
65 luis enjuanes autonomous university of madrid christian drosten bernhard nocht institute for tropical medicine 1.00 4.72 -3.72
430 luis enjuanes autonomous university of madrid ralph s baric university of north carolina at chapel hill 3.00 6.07 -3.07
1168 luis enjuanes autonomous university of madrid linda j saif ohio state university 0.00 2.84 -2.84
3735 luis enjuanes autonomous university of madrid m l ballesteros spanish national research council 1.00 3.82 -2.82
5680 luis enjuanes autonomous university of madrid jose manuel gonzalez spanish national research council 1.00 3.71 -2.71
312 luis enjuanes autonomous university of madrid peter j m rottier utrecht university 0.00 2.61 -2.61
1035 luis enjuanes autonomous university of madrid georg herrler unknown 1.00 3.60 -2.60
336 luis enjuanes autonomous university of madrid volker thiel university of zurich 0.00 2.39 -2.39
34 luis enjuanes autonomous university of madrid kwokhung chan li ka shing faculty of medicine university of ... 0.00 2.36 -2.36
313 luis enjuanes autonomous university of madrid bart l haagmans erasmus university rotterdam 1.00 3.36 -2.36
1649 luis enjuanes autonomous university of madrid benjamin w neuman texas a m university texarkana 0.00 2.30 -2.30

9.3.Neural Network Regression

In [57]:
top = 15
col_in_focus = "rweight_neural_reg"
v_data["delta"] = v_data.rweight - v_data[col_in_focus]
cols_in_focus = interesting_cols + [col_in_focus] + ["delta"]

Check the top cols by linear regression prediction:

In [58]:
v_data[cols_in_focus].sort_values(by=col_in_focus, ascending=False).head(top)
Out[58]:
p1normauthorname p1authoraffiliationname p2normauthorname p2authoraffiliationname rweight rweight_neural_reg delta
2479 luis enjuanes autonomous university of madrid isabel sola spanish national research council 41.00 42.13 -1.13
1221 luis enjuanes autonomous university of madrid marta l dediego spanish national research council 18.00 19.98 -1.98
6659 luis enjuanes autonomous university of madrid sonia zuniga spanish national research council 20.00 14.80 5.20
3651 luis enjuanes autonomous university of madrid carlos sanchez autonomous university of madrid 2.00 14.18 -12.18
3650 luis enjuanes autonomous university of madrid carlos sanchez autonomous university of madrid 12.00 14.18 -2.18
1295 luis enjuanes autonomous university of madrid fernando almazan autonomous university of madrid 26.00 13.90 12.10
1296 luis enjuanes autonomous university of madrid fernando almazan autonomous university of madrid 1.00 13.90 -12.90
3653 luis enjuanes autonomous university of madrid david escors unknown 9.00 10.14 -1.14
435 luis enjuanes autonomous university of madrid stanley perlman university of iowa 8.00 9.77 -1.77
8661 luis enjuanes autonomous university of madrid cristian smerdou spanish national research council 7.00 9.13 -2.13
1219 luis enjuanes autonomous university of madrid jose l nietotorres spanish national research council 11.00 8.99 2.01
3655 luis enjuanes autonomous university of madrid javier ortego unknown 8.00 8.80 -0.80
2477 luis enjuanes autonomous university of madrid raul fernandezdelgado spanish national research council 7.00 8.65 -1.65
3658 luis enjuanes autonomous university of madrid sara alonso spanish national research council 10.00 7.20 2.80
1391 luis enjuanes autonomous university of madrid jose a reglanava spanish national research council 7.00 6.47 0.53

And the top columns by difference between prediction and reality:

In [59]:
v_data[cols_in_focus].sort_values(by="delta", ascending=False).head(top)
Out[59]:
p1normauthorname p1authoraffiliationname p2normauthorname p2authoraffiliationname rweight rweight_neural_reg delta
1295 luis enjuanes autonomous university of madrid fernando almazan autonomous university of madrid 26.00 13.90 12.10
6659 luis enjuanes autonomous university of madrid sonia zuniga spanish national research council 20.00 14.80 5.20
6657 luis enjuanes autonomous university of madrid silvia marquezjurado spanish national research council 7.00 2.70 4.30
5497 luis enjuanes autonomous university of madrid jose m gonzalez spanish national research council 6.00 2.31 3.69
9025 luis enjuanes autonomous university of madrid joaquin castilla spanish national research council 6.00 2.53 3.47
8090 luis enjuanes autonomous university of madrid aitor nogales autonomous university of madrid 7.00 3.59 3.41
12591 luis enjuanes autonomous university of madrid carmen capiscol unknown 5.00 1.77 3.23
1220 luis enjuanes autonomous university of madrid enrique alvarez spanish national research council 9.00 5.81 3.19
3658 luis enjuanes autonomous university of madrid sara alonso spanish national research council 10.00 7.20 2.80
1297 luis enjuanes autonomous university of madrid carmen galan unknown 8.00 5.49 2.51
12590 luis enjuanes autonomous university of madrid jose l moreno unknown 4.00 1.52 2.48
4188 luis enjuanes autonomous university of madrid lorena palacio unknown 4.00 1.61 2.39
5502 luis enjuanes autonomous university of madrid enrique calvo unknown 6.00 3.63 2.37
1219 luis enjuanes autonomous university of madrid jose l nietotorres spanish national research council 11.00 8.99 2.01
2278 luis enjuanes autonomous university of madrid kathryn v holmes university of colorado hospital 4.00 2.15 1.85

And the bottom columns by delta:

In [60]:
v_data[cols_in_focus].sort_values(by="delta", ascending=True).head(top)
Out[60]:
p1normauthorname p1authoraffiliationname p2normauthorname p2authoraffiliationname rweight rweight_neural_reg delta
1296 luis enjuanes autonomous university of madrid fernando almazan autonomous university of madrid 1.00 13.90 -12.90
3651 luis enjuanes autonomous university of madrid carlos sanchez autonomous university of madrid 2.00 14.18 -12.18
429 luis enjuanes autonomous university of madrid ralph s baric university of north carolina at chapel hill 1.00 4.41 -3.41
65 luis enjuanes autonomous university of madrid christian drosten bernhard nocht institute for tropical medicine 1.00 4.37 -3.37
3735 luis enjuanes autonomous university of madrid m l ballesteros spanish national research council 1.00 3.85 -2.85
312 luis enjuanes autonomous university of madrid peter j m rottier utrecht university 0.00 2.71 -2.71
1035 luis enjuanes autonomous university of madrid georg herrler unknown 1.00 3.69 -2.69
5680 luis enjuanes autonomous university of madrid jose manuel gonzalez spanish national research council 1.00 3.67 -2.67
2302 luis enjuanes autonomous university of madrid ding xiang liu nanyang technological university 0.00 2.43 -2.43
15536 luis enjuanes autonomous university of madrid ines m anton spanish national research council 4.00 6.22 -2.22
3650 luis enjuanes autonomous university of madrid carlos sanchez autonomous university of madrid 12.00 14.18 -2.18
8661 luis enjuanes autonomous university of madrid cristian smerdou spanish national research council 7.00 9.13 -2.13
33 luis enjuanes autonomous university of madrid kwokyung yuen university of hong kong 0.00 2.05 -2.05
336 luis enjuanes autonomous university of madrid volker thiel university of zurich 0.00 2.02 -2.02
1281 luis enjuanes autonomous university of madrid marian c horzinek utrecht university 1.00 3.00 -2.00