Yes you should definitely use FSK but it has to asynchronical and here is why. Demodulating FSK synchronically is called coherent demodulation. Demodulating FSK coherently requires incoming carrier phase locking (phase synchronization), which is normally done with Phase lock Loops (PLL) which don´t work well unless you have a Signal to Noise Ratio (SNR) of at least 10 dB or so. Normally audio broadcast doesn´t have such a high SNR, so forget about coherent demodulation. Besides in practice with electromagnetic signals the noncoherent FSK requires, atmost, only 1 dB more Eb/No than that for coherent FSK for Pb ≤ 10−4 (this means that for obtaining the same bit error probability Pb you only need to transmit an extra dB of power in each bit). Yet the noncoherent FSK demodulator is considerably easier to build since coherent reference signals need not be generated. Therefore in practical systems almost all of the FSK receivers use noncoherent demodulation because everyone prefers transmitting an extra dB of power instead of getting in all that sync problems.
Answering to your main considerations:
200bps bandwidth if not more:
I've achieved 200 bps using a continious phase orthogonal carriers 8fsk, setting the smartphone 1 m away from the speaker.
Resilient to noise upto certain level:
I've implemented a BCH error correcting code, with the ability to repair up to 8 error per data block. BCH codes have their biggest coding gains when there is a 25 to 50% redundancy added
preferably 16khz - 20khz carrier wave with 44.1khz sampling :
I would suggest increasing the sampling frequency up to 48 kHz (which is quite common nowadays in smartphones) and limiting your operation bandwith between 17.5 or 18 and 21.5 kHz. If you use 44.1 then you have to work between 17.5 or 18 and 20.5 kHz. But you have to be very careful with the speaker and microphone you select as not all of them operate at this high frequencies. You have to make a frequency reponse analysis of them. If you're using a PC I would recommend ARTA or Audacity or if you are a programmer Matlab or Octave. If you are using a smartphone I would recommed any audio spectrum analysis app
Not too complex coding logic:
I would recommend the non coherent correlation quadrature fsk demodulator. Much lighter than any fft based implementation. Specially if you are less than 1 m away where doppler and multipath don´t afect you so much