When decomposing the 4-qubit Toffoli in the Clifford+T universal gate set with 1 ancilla qubit, what is the most efficient implementation one can get in terms of T-count? I can only find papers that handle the problem for general n-qubit Toffoli gates, but I am not sure if there exists some better implementation specifically for small n. For example, the following paper mentions they can do it with a T-count of 32 * n + 96 (for n = 4, this is 32), but this seems quite poor given that the T-count of the 3-qubit Toffoli (without ancilla) is just 7.
Paper: (Decompositions of n-qubit Toffoli Gates with Linear Circuit Complexity, https://link.springer.com/article/10.1007/s10773-017-3389-4)

