In order for a K.gradients() layer to work like that, you have to enclose it in a Lambda() layer, because otherwise a full Keras layer is not created, and you can't chain it or train through it. So this code will work (tested):
import keras
from keras.models import *
from keras.layers import *
from keras import backend as K
import tensorflow as tf
def grad( y, x ):
return Lambda( lambda z: K.gradients( z[ 0 ], z[ 1 ] ), output_shape = [1] )( [ y, x ] )
def network( i, d ):
m = Add()( [ i, d ] )
a = Lambda(lambda x: K.log( x ) )( m )
return a
fixed_input = Input(tensor=tf.constant( [ 1.0 ] ) )
double = Input(tensor=tf.constant( [ 2.0 ] ) )
a = network( fixed_input, double )
b = grad( a, fixed_input )
c = grad( b, fixed_input )
d = grad( c, fixed_input )
e = grad( d, fixed_input )
model = Model( inputs = [ fixed_input, double ], outputs = [ a, b, c, d, e ] )
print( model.predict( x=None, steps = 1 ) )
def network models f( x ) = log( x + 2 ) at x = 1. def grad is where the gradient calculation is done. This code outputs:
[array([1.0986123], dtype=float32), array([0.33333334], dtype=float32), array([-0.11111112], dtype=float32), array([0.07407408], dtype=float32), array([-0.07407409], dtype=float32)]
which are the correct values for log( 3 ), ⅓, -1 / 32, 2 / 33, -6 / 34.
Reference TensorFlow code
For reference, the same code in plain TensorFlow (used for testing):
import tensorflow as tf
a = tf.constant( 1.0 )
a2 = tf.constant( 2.0 )
b = tf.log( a + a2 )
c = tf.gradients( b, a )
d = tf.gradients( c, a )
e = tf.gradients( d, a )
f = tf.gradients( e, a )
with tf.Session() as sess:
print( sess.run( [ b, c, d, e, f ] ) )
outputs the same values:
[1.0986123, [0.33333334], [-0.11111112], [0.07407408], [-0.07407409]]
Hessians
tf.hessians() does return the second derivative, that's a shorthand for chaining two tf.gradients(). The Keras backend doesn't have hessians though, so you do have to chain the two K.gradients().
Numerical approximation
If for some reason none of the above works, then you might want to consider numerically approximating the second derivative with taking the difference over a small ε distance. This basically triples the network for each input, so this solution introduces serious efficiency considerations, besides lacking in accuracy. Anyway, the code (tested):
import keras
from keras.models import *
from keras.layers import *
from keras import backend as K
import tensorflow as tf
def network( i, d ):
m = Add()( [ i, d ] )
a = Lambda(lambda x: K.log( x ) )( m )
return a
fixed_input = Input(tensor=tf.constant( [ 1.0 ], dtype = tf.float64 ) )
double = Input(tensor=tf.constant( [ 2.0 ], dtype = tf.float64 ) )
epsilon = Input( tensor = tf.constant( [ 1e-7 ], dtype = tf.float64 ) )
eps_reciproc = Input( tensor = tf.constant( [ 1e+7 ], dtype = tf.float64 ) )
a0 = network( Subtract()( [ fixed_input, epsilon ] ), double )
a1 = network( fixed_input, double )
a2 = network( Add()( [ fixed_input, epsilon ] ), double )
d0 = Subtract()( [ a1, a0 ] )
d1 = Subtract()( [ a2, a1 ] )
dv0 = Multiply()( [ d0, eps_reciproc ] )
dv1 = Multiply()( [ d1, eps_reciproc ] )
dd0 = Multiply()( [ Subtract()( [ dv1, dv0 ] ), eps_reciproc ] )
model = Model( inputs = [ fixed_input, double, epsilon, eps_reciproc ], outputs = [ a0, dv0, dd0 ] )
print( model.predict( x=None, steps = 1 ) )
Outputs:
[array([1.09861226]), array([0.33333334]), array([-0.1110223])]
(This only gets to the second derivative.)