I have built a fully-connected neural network in both scikit-learn (v 0.20.0) and Keras (v 2.2.4) with TensorFlow backend (v 1.12.0). There are 10 units in the single hidden layer. In both cases I choose the training and test data via a call to scikit-learn’s train_test_split function with random_state set to 0. They are then both scaled using scikit-learn’s StandardScaler. In fact, up to this point the code for each case is literally identical.
In scikit-learn I define the neural network with MLPRegressor. The output of that function call is
MLPRegressor(activation=’logistic’, alpha=1.0, batch_size=’auto’, beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(10,), learning_rate=’constant’,
learning_rate_init=0.001, max_iter=200, momentum=0.9,
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
random_state=None, shuffle=True, solver=’sgd’, tol=0.0001,
validation_fraction=0.2, verbose=False, warm_start=False)
Most of those parameters aren’t used, but the some of the relevant parameters are that there are 200 iterations, no early stopping, a constant learning rate, solver is SGD, nesterovs_momentum=True, and momentum=0.9.
The definition in Keras is (call this Keras 1)
mlp = Sequential() # create a sequential neural network using Keras
mlp.add(Dense(units=10,activation=’sigmoid’,input_dim=X.shape[1],
kernel_regularizer=skl_norm))
mlp.add(Dense(units=1,activation=’linear’))
opt = optimizers.SGD(lr=0.001,momentum=0.9,decay=0.0,nesterov=True)
mlp.compile(optimizer=opt,loss=’mean_squared_error’)
mlp.fit(X_train,y_train,batch_size=200,epochs=200,verbose=0)
My understanding of Keras is that this should be the same network as the scikit-learn one with one possible exception, scikit-learn should be regularizing all the weights between layers, while this Keras network only regularizes the weights going in to the hidden layer from the input layer. I can add regularization of the weights from the hidden layer to the output layer in the following way (call this Keras 2)
mlp = Sequential() # create a sequential neural network using Keras
mlp.add(Dense(units=10,activation=’sigmoid’,input_dim=X.shape[1],
kernel_regularizer=skl_norm))
mlp.add(Dense(units=1,activation=’linear’,kernel_regularizer=skl_norm))
opt = optimizers.SGD(lr=0.001,momentum=0.9,decay=0.0,nesterov=True)
mlp.compile(optimizer=opt,loss=’mean_squared_error’)
mlp.fit(X_train,y_train,batch_size=200,epochs=200,verbose=0)
To make sure the regularization in Keras matches that in scikit-learn, I have implemented a custom regularization function in Keras:
def skl_norm(weight_matrix):
alpha = 1.0 # to match parameter I used in sci-kit learn
return alpha * 0.5 * K.sum(K.square(weight_matrix))
Where the alpha parameter should be the same as appears in scikit-learn. The code following these definitions differ only in the names of methods used by each of the API.
My results suggest that regularization is not the same in the two APIs or, more likely, my implementation in Keras is not what I think it is. Here is a comparison between the outputs of the neural networks:
Top row is alpha = 0, bottom row is alpha = 1.0. Left column is scikit-learn, middle column is Keras 1, right column is Keras 2. Rather than discuss all the differences between the plots, what jumps out at me immediately is that when regularization is “turned off” (alpha=0) the fits are very similar. When regularization is “turned on” (alpha=1) scikit-learn outperforms Keras, especially Keras 2 when the outputs of the hidden layer are regularized.
On different runs the R^2 values vary a little but, but are not large enough to account for the differences in the bottom row. So, what is the difference between these two network implementations?
Update:
I have since found that if I use an “unbounded” activation function in Keras, the training will fail entirely returning nan for all predictions, whereas it is fine in scikit-learn. By “unbounded” I mean an activation that allows output values of infinity, for example linear/identity, softplus or relu.
When I turn on the TensorBoard callback I get an error that ends with (edited to leave out irrelevant potentially sensitive information):
InvalidArgumentError (see above for traceback): Nan in summary histogram for: dense_2/bias_0
[[node dense_2/bias_0 (defined at /Users/…/python2.7/site-packages/keras/callbacks.py:796) = HistogramSummary[T=DT_FLOAT, _device=”/job:localhost/replica:0/task:0/device:CPU:0″](dense_2/bias_0/tag, dense_2/bias/read)]]
Based on this error, I guess that the bias units for the second layer are getting really large, but I don’t know why this would happen in Keras/TF but not scikit-learn.
Since softplus doesn’t have the property that f(x)=0 when x=0, I don’t think the problem is that the inputs are nearly zero. Furthermore, a tanh activation works really well. So I don’t think I’m having an issue with inputs clustering near zero. Both sigmoid/logistic and softplus have the property f(x)=0 when x->-infinity and sigmoid/logistic works well while softplus fails. So I don’t think I’m having an issue with inputs going to -infinity.
Okay, this is a complex issue involving subtle differences between implementations in scikit-learn and Keras/TensorFlow. Let\’s break down the possible causes and how to investigate them. Given the details and the systematic approach you\’ve taken, here\’s a comprehensive analysis and debugging strategy.\n\n**Root Causes and Hypotheses:**\n\n1. **Regularization Discrepancy:** Despite your efforts, the regularization *might still* be different between the two implementations. This is the most likely culprit given the symptom of large differences when regularization is active.\n\n2. **Initialization Differences:** Keras and scikit-learn might use different default weight initialization schemes. Even small differences in initial weights can lead to divergence, especially with different activation functions and regularization.\n\n3. **Numerical Instability with Unbounded Activation Functions:** Your observation about `NaN` values with unbounded activations in Keras points to a serious numerical instability issue. This is likely exacerbated by regularization (or lack thereof) and potentially the default initialization. The bias term exploding strongly suggests this instability.\n\n4. **Gradient Clipping:** Scikit-learn might be internally using gradient clipping to prevent exploding gradients, which Keras, by default, isn\’t. This is especially relevant with unbounded activations.\n\n5. **Batch Size Interaction with Regularization and Unbounded Activations:** The batch size of 200 might be interacting negatively with the regularization strength and unbounded activations in Keras, leading to large weight updates and instability.\n\n6. **Scaling Issues:** Although you mention that the inputs are scaled using scikit-learn\’s `StandardScaler`, it is worth double checking how this interacts with the unbounded activation functions.\n\n**Debugging and Solution Strategy:**\n\n1. **Double-Check Regularization Implementation:** This is critical. Let\’s revisit your Keras regularization function:\n\n “`python\n import keras.backend as K\n from keras.regularizers import Regularizer\n \n class SKLNorm(Regularizer):\n \”\”\”L2 Regularizer that matches scikit-learn\’s implementation.\”\”\”\n \n def __init__(self, alpha=0.0):\n self.alpha = K.cast_to_floatx(alpha) # Ensure correct dtype\n \n def __call__(self, x):\n regularization = self.alpha * 0.5 * K.sum(K.square(x))\n return regularization\n \n def get_config(self):\n return {\’alpha\’: float(self.alpha)} # Ensure serialization\n\n # Usage in Keras model:\n skl_norm = SKLNorm(alpha=1.0)\n\n mlp = Sequential()\n mlp.add(Dense(units=10, activation=\’sigmoid\’, input_dim=X.shape[1], kernel_regularizer=skl_norm))\n mlp.add(Dense(units=1, activation=\’linear\’, kernel_regularizer=skl_norm)) #Regularize output layer as well\n opt = optimizers.SGD(lr=0.001, momentum=0.9, decay=0.0, nesterov=True)\n mlp.compile(optimizer=opt, loss=\’mean_squared_error\’)\n mlp.fit(X_train, y_train, batch_size=200, epochs=200, verbose=0)\n\n “`\n\n Key changes and reasons:\n\n * **Using `Regularizer` Class:** The preferred way to define custom regularizers in Keras is by subclassing `keras.regularizers.Regularizer`. This ensures proper integration with the Keras framework. It requires implementing `__call__` (the regularization calculation) and `get_config` (for saving/loading models).\n * **Explicitly Casting to Float:** Use `K.cast_to_floatx(alpha)` to ensure the `alpha` value has the correct data type (float32 or float64) for your TensorFlow backend. This can prevent subtle type-related issues.\n * **Serialization:** The `get_config` method is crucial if you plan to save and load your model. It tells Keras how to serialize the custom regularizer. Ensure that the value is float.\n * **Instantiation and Usage:** Instantiate the `SKLNorm` regularizer with the desired `alpha` value *before* creating the layers. Pass the instance to the `kernel_regularizer` argument.\n\n2. **Weight Initialization:** Experiment with Keras weight initializers to match scikit-learn\’s (if possible to determine). Common options are `\’glorot_uniform\’` (also called Xavier initialization) or `\’he_normal\’`.\n\n “`python\n from keras.initializers import GlorotUniform\n \n initializer = GlorotUniform(seed=None) # Or HeNormal()\n \n mlp = Sequential()\n mlp.add(Dense(units=10, activation=\’sigmoid\’, input_dim=X.shape[1], kernel_regularizer=skl_norm, kernel_initializer=initializer))\n mlp.add(Dense(units=1, activation=\’linear\’, kernel_regularizer=skl_norm, kernel_initializer=initializer))\n # … rest of the model definition\n “`\n\n Try different initializers and see if it reduces the instability with unbounded activations. If you can determine the scikit-learn\’s default initializer, use that in Keras.\n\n3. **Gradient Clipping:** Implement gradient clipping in Keras. This is a good practice, especially with unbounded activations and can often resolve `NaN` issues.\n\n “`python\n opt = optimizers.SGD(lr=0.001, momentum=0.9, decay=0.0, nesterov=True, clipnorm=1.0) # Clip gradients to a maximum norm of 1.0\n mlp.compile(optimizer=opt, loss=\’mean_squared_error\’)\n “`\n\n Experiment with different `clipnorm` values (e.g., 0.5, 1.0, 5.0).\n\n4. **Reduce Batch Size:** Try smaller batch sizes (e.g., 32, 64). Smaller batches can sometimes stabilize training.\n\n5. **Learning Rate Decay:** Implement a learning rate decay schedule in Keras. This can help the network converge more smoothly, especially in later epochs.\n\n “`python\n from keras.callbacks import LearningRateScheduler\n import math\n \n def step_decay(epoch):\n initial_lrate = 0.001\n drop = 0.5 #Reduce learning rate by half\n epochs_drop = 10.0 #Reduce learning rate every 10 epochs\n lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop))\n return lrate\n \n lrate = LearningRateScheduler(step_decay)\n callbacks_list = [lrate]\n\n mlp = Sequential()\n mlp.add(Dense(units=10, activation=\’sigmoid\’, input_dim=X.shape[1], kernel_regularizer=skl_norm))\n mlp.add(Dense(units=1, activation=\’linear\’, kernel_regularizer=skl_norm))\n opt = optimizers.SGD(lr=0.001, momentum=0.9, decay=0.0, nesterov=True)\n mlp.compile(optimizer=opt, loss=\’mean_squared_error\’)\n mlp.fit(X_train, y_train, batch_size=200, epochs=200, verbose=0, callbacks=callbacks_list)\n “`\n\n This example implements a step decay, but other decay schedules are available (e.g., exponential decay).\n\n6. **Examine Scaled Data** Print out some values from `X_train` and `y_train` to make sure that they are within reasonable bounds, especially given that you are using an unbounded activation function.\n\n7. **TensorBoard for Deeper Insight:** Use TensorBoard extensively. Monitor the following:\n\n * **Weight Histograms:** Check the distributions of weights in each layer. Look for unusually large values or rapid changes.\n * **Gradient Histograms:** Monitor the gradients during training. Exploding gradients are a clear sign of instability.\n * **Loss Curve:** Ensure the loss is decreasing smoothly. Spikes or oscillations can indicate problems.\n * **Activation Histograms:** Monitor the activation values to see if they are becoming very large or concentrated at specific values.\n\n8. **Simplified Model for Debugging:** Start with a *very* simple model (e.g., no hidden layer, or a single neuron in the hidden layer). Get that working reliably with Keras. Then, gradually increase the complexity. This helps isolate the source of the instability.\n\n9. **Check Activation Function Implementation:** Although unlikely, verify that the sigmoid (logistic) function in scikit-learn and Keras are exactly the same. Different libraries sometimes have small variations in their implementations.\n\n**Code Example incorporating recommendations:**\n\n“`python\nimport keras\nfrom keras.models import Sequential\nfrom keras.layers import Dense\nfrom keras import optimizers\nfrom keras.callbacks import TensorBoard, LearningRateScheduler\nimport keras.backend as K\nfrom keras.regularizers import Regularizer\nfrom keras.initializers import GlorotUniform\nimport math\nimport numpy as np\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\n\n# Custom Regularizer\nclass SKLNorm(Regularizer):\n \”\”\”L2 Regularizer that matches scikit-learn\’s implementation.\”\”\”\n\n def __init__(self, alpha=0.0):\n self.alpha = K.cast_to_floatx(alpha) # Ensure correct dtype\n\n def __call__(self, x):\n regularization = self.alpha * 0.5 * K.sum(K.square(x))\n return regularization\n\n def get_config(self):\n return {\’alpha\’: float(self.alpha)} # Ensure serialization\n\n# Learning Rate Schedule\ndef step_decay(epoch):\n initial_lrate = 0.001\n drop = 0.5 #Reduce learning rate by half\n epochs_drop = 10.0 #Reduce learning rate every 10 epochs\n lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop))\n return lrate\n\n# Generate some dummy data\nX = np.random.rand(100, 5)\ny = np.random.rand(100)\n\n# Split data\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n\n# Scale Data\nscaler = StandardScaler()\nX_train = scaler.fit_transform(X_train)\nX_test = scaler.transform(X_test)\n\n# Parameters\nalpha = 1.0 #Regularization strength\nbatch_size = 32 #Reduced batch size\nepochs = 200\n\n# Define Regularizer and Initializer\nskl_norm = SKLNorm(alpha=alpha)\ninitializer = GlorotUniform(seed=None)\n\n# Define Model\nmlp = Sequential()\nmlp.add(Dense(units=10, activation=\’sigmoid\’, input_dim=X_train.shape[1],\n kernel_regularizer=skl_norm, kernel_initializer=initializer))\nmlp.add(Dense(units=1, activation=\’linear\’, kernel_regularizer=skl_norm,\n kernel_initializer=initializer))\n\n# Optimizer with Gradient Clipping\nopt = optimizers.SGD(lr=0.001, momentum=0.9, decay=0.0, nesterov=True, clipnorm=1.0)\n\n# Compile Model\nmlp.compile(optimizer=opt, loss=\’mean_squared_error\’)\n\n# Callbacks\nlrate = LearningRateScheduler(step_decay)\ntensorboard_callback = TensorBoard(log_dir=\’./logs\’, histogram_freq=1, write_graph=True, write_grads=True, write_images=True) #Enable histogram monitoring\ncallbacks_list = [lrate, tensorboard_callback]\n\n# Train Model\nmlp.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1,\n validation_data=(X_test, y_test), callbacks=callbacks_list)\n\n# Evaluate Model\nloss = mlp.evaluate(X_test, y_test, verbose=0)\nprint(f\”Test Loss: {loss}\”)\n\n“`\n\nKey points about the code example:\n\n* **Complete and runnable:** The example includes dummy data generation, splitting, and scaling, so you can run it directly. It also includes all the recommended changes (custom regularizer, initialization, gradient clipping, learning rate decay, TensorBoard).\n* **TensorBoard Enabled:** The `TensorBoard` callback is configured to record histograms of weights and gradients. This is crucial for debugging the instability. Run `tensorboard –logdir ./logs` in your terminal to view the logs.\n* **Reduced Batch Size:** Batch size is set to 32.\n* **Verbose Training:** `verbose=1` is used during training to show progress.\n* **Validation Data:** `validation_data` is passed to `fit` to monitor performance on the test set during training.\n\n**Troubleshooting Steps:**\n\n1. **Run the provided code example.** Does it run without `NaN` errors? If not, there\’s a problem with your TensorFlow/Keras installation or environment.\n\n2. **Examine TensorBoard:** After running the example for a few epochs, open TensorBoard and carefully examine the weight and gradient histograms. Look for:\n\n * Weights or gradients that are becoming very large (positive or negative).\n * Gradients that are consistently zero.\n * Weights or gradients that are oscillating wildly.\n\n3. **Experiment:** Systematically change one thing at a time (e.g., different initializers, clipping values, learning rate decay parameters) and observe the effect on the training process in TensorBoard.\n\n4. **Simplify:** If the problem persists, go back to an even simpler model (e.g., no hidden layer) and get that working reliably.\n\nBy following this methodical approach, you should be able to pinpoint the cause of the discrepancies between your scikit-learn and Keras implementations and resolve the numerical instability issues. The most likely culprits are still related to regularization, initialization, or gradient handling. Good luck!\n