0

I have built a fully-connected neural network in both scikit-learn (v 0.20.0) and Keras (v 2.2.4) with TensorFlow backend (v 1.12.0). There are 10 units in the single hidden layer. In both cases I choose the training and test data via a call to scikit-learn’s train_test_split function with random_state set to 0. They are then both scaled using scikit-learn’s StandardScaler. In fact, up to this point the code for each case is literally identical.
In scikit-learn I define the neural network with MLPRegressor. The output of that function call is
MLPRegressor(activation=’logistic’, alpha=1.0, batch_size=’auto’, beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(10,), learning_rate=’constant’,
learning_rate_init=0.001, max_iter=200, momentum=0.9,
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
random_state=None, shuffle=True, solver=’sgd’, tol=0.0001,
validation_fraction=0.2, verbose=False, warm_start=False)

Most of those parameters aren’t used, but the some of the relevant parameters are that there are 200 iterations, no early stopping, a constant learning rate, solver is SGD, nesterovs_momentum=True, and momentum=0.9.
The definition in Keras is (call this Keras 1)
mlp = Sequential() # create a sequential neural network using Keras
mlp.add(Dense(units=10,activation=’sigmoid’,input_dim=X.shape[1],
kernel_regularizer=skl_norm))
mlp.add(Dense(units=1,activation=’linear’))
opt = optimizers.SGD(lr=0.001,momentum=0.9,decay=0.0,nesterov=True)
mlp.compile(optimizer=opt,loss=’mean_squared_error’)
mlp.fit(X_train,y_train,batch_size=200,epochs=200,verbose=0)

My understanding of Keras is that this should be the same network as the scikit-learn one with one possible exception, scikit-learn should be regularizing all the weights between layers, while this Keras network only regularizes the weights going in to the hidden layer from the input layer. I can add regularization of the weights from the hidden layer to the output layer in the following way (call this Keras 2)
mlp = Sequential() # create a sequential neural network using Keras
mlp.add(Dense(units=10,activation=’sigmoid’,input_dim=X.shape[1],
kernel_regularizer=skl_norm))
mlp.add(Dense(units=1,activation=’linear’,kernel_regularizer=skl_norm))
opt = optimizers.SGD(lr=0.001,momentum=0.9,decay=0.0,nesterov=True)
mlp.compile(optimizer=opt,loss=’mean_squared_error’)
mlp.fit(X_train,y_train,batch_size=200,epochs=200,verbose=0)

To make sure the regularization in Keras matches that in scikit-learn, I have implemented a custom regularization function in Keras:
def skl_norm(weight_matrix):
alpha = 1.0 # to match parameter I used in sci-kit learn
return alpha * 0.5 * K.sum(K.square(weight_matrix))

Where the alpha parameter should be the same as appears in scikit-learn. The code following these definitions differ only in the names of methods used by each of the API.
My results suggest that regularization is not the same in the two APIs or, more likely, my implementation in Keras is not what I think it is. Here is a comparison between the outputs of the neural networks:

Top row is alpha = 0, bottom row is alpha = 1.0. Left column is scikit-learn, middle column is Keras 1, right column is Keras 2. Rather than discuss all the differences between the plots, what jumps out at me immediately is that when regularization is “turned off” (alpha=0) the fits are very similar. When regularization is “turned on” (alpha=1) scikit-learn outperforms Keras, especially Keras 2 when the outputs of the hidden layer are regularized.
On different runs the R^2 values vary a little but, but are not large enough to account for the differences in the bottom row. So, what is the difference between these two network implementations?
Update:
I have since found that if I use an “unbounded” activation function in Keras, the training will fail entirely returning nan for all predictions, whereas it is fine in scikit-learn. By “unbounded” I mean an activation that allows output values of infinity, for example linear/identity, softplus or relu.
When I turn on the TensorBoard callback I get an error that ends with (edited to leave out irrelevant potentially sensitive information):

InvalidArgumentError (see above for traceback): Nan in summary histogram for: dense_2/bias_0
[[node dense_2/bias_0 (defined at /Users/…/python2.7/site-packages/keras/callbacks.py:796) = HistogramSummary[T=DT_FLOAT, _device=”/job:localhost/replica:0/task:0/device:CPU:0″](dense_2/bias_0/tag, dense_2/bias/read)]]

Based on this error, I guess that the bias units for the second layer are getting really large, but I don’t know why this would happen in Keras/TF but not scikit-learn.
Since softplus doesn’t have the property that f(x)=0 when x=0, I don’t think the problem is that the inputs are nearly zero. Furthermore, a tanh activation works really well. So I don’t think I’m having an issue with inputs clustering near zero. Both sigmoid/logistic and softplus have the property f(x)=0 when x->-infinity and sigmoid/logistic works well while softplus fails. So I don’t think I’m having an issue with inputs going to -infinity.

Kuldeep Baberwal Changed status to publish February 17, 2025