My Musical Experiments with Artificial Intelligence – Part 4

Improvisation on tabla strikes a delicate balance between the two things –

Artistic improvisation within a rhythmic cycle itself. For example, within Teentaal, the tabla player is free to play variations as they sound aesthetically pleasing
Taking influences – mostly rhythmic – from the lead musician and producing them on the tabla

A good tabla player will strike the right aesthetic balance between these two. Traditions and the type of music (vocal, instrumental and sub-genre also called Gharana) dictate this balance. Balance too biased towards taking influences makes the performance sound too “meddling” and interferes with overall effect of the concert.

Therefore, one can imagine two neural networks concurrently working on both these tracks and the output of those two is mixed through a mix gate with a choice to set bias.

For the example demonstrated below, we choose just one of the neural networks – the one accomplishing only improvisation within the cycle.

The architecture of such neural network would involve solving three problems –

Generating training dataset
Instantiating and training recurrent neural network
Using trained neural network to produce improvisation

Generating Training Dataset

We could make this entire system think in terms of sounds. In this case, we will need really high quality recordings of a large number of concerts. We could time slice such recordings into small mp3 segments to generate training dataset.

Perhaps a better and simpler approach would be to use MIDI sounds instead of real sounds. The advantage of this approach is that it insulates us from the sound recording quality of the input dataset. There are a lot of excellent concerts available on sources such as YouTube. These concerts have excellent performances but not necessarily great sound quality. If we pre-process the not-so-good sound quality recordings to translate them into MIDI sounds, the subsequent flow is not only not dependent on sound quality but we could also produce excellent sampling of MIDI sounds to generate high quality output once the Neural Network is fully trained.

An interesting question would come up around royalties for the use of these concert recordings from sources such as YouTube. At first, one may think that royalties would be owed to the artists. However, there are two counterpoints to this – firstly, if we slice and translate performance to MIDI sounds, one can argue that one would make substantial changes to the original and therefore new material could be argued to be original production.

More importantly, this raises another interesting legal question. We could argue that learning of the neural network is substantially similar to human learning. Therefore, no one concert really trains the network and as you would see, the output produced by the network is not identical to any of the concerts. The network learns from several concerts, just like humans, and therefore one could argue that no royalties would be due, just like you cannot expect a real tabla player to pay royalties for listening to concerts and learning from them over period of time.

As a musician, I do feel strongly about royalties. However, as you can see from the arguments above, I do think this issue is both legally and ethically unresolved.

Generating training dataset from a series of concert also gives rise to further interesting possibilities. For example, one can choose only to sample concerts performed by one artist (say Zakir Hussain). In such case, the network would learn the style of only this artist and presumably would produce output that stylistically sounds like this artist. Or one can sample from concerts of multiple artists. In this case, the network will develop its own musical personality. The possibilities here are very interesting musically.

In the demo below, the training dataset has been generated using Excel lookups. The logic follows sticking to the rules of the Teentaal rhythm, but having a lookup of possible sounds and randomizing them to mix together. Each cycle produced in the training data is a legitimate Teentaal rhythm.

Instantiating and Training Recurrent Neural Network

You can see the code to do this here – https://github.com/PrasadBhandarkar/DeepTabla-Project/blob/master/Deep%20Tabla.ipynb

Here are the main ideas –

The training dataset in this case is created in terms of text representation of sounds (e.g. Dha, Dhin etc). The first beat has been represented in upper case (e.g. DHA). There are 1.68 million records in the training dataset, each record comprising of 16 words corresponding to 16 beats in the cycle.

We consider this dataset as a stream of data in time i.e. assume that all these individual records are serially concatenated, thus creating a stream of beats, in fact 1.68 million x 16 of them.

We then slice this set into inputs and outputs. Length of each slice is determined by the variable sequence_length. While in the example it is set to 16, changing it to any arbitrary number like 10, 13 or 17 does not make any difference. This variable essentially lets us tell the program to slice data in such a way that beats 1 – 16 (or the number represented by sequence_length) beats is input that produces 2 – 17 beats as output. Similarly beats 2-17 produce beats 3- 18 as output. We can then iterate through the entire sequence to produce input-output combinations of almost 1.68 million records.

The next step is to translate our data into numbers. Remember that neural networks are mathematical models that think in terms of Tensors. Therefore we iterate through the entire dataset and build a dictionary of unique rhythmic syllables. In this case there are 31, but this is so because the training data is generated in a simplistic manner. In reality this number could be much bigger, but that does not change the basic logic.

Once we build the dictionary where each syllable gets a unique index, we then translate the training data in terms of numbers.

sequence_length variable also determines the number of input neurons. Each of the input neuron can take the value between 0 and 31 (the number of syllables in the dictionary). The output neuron therefore have to be 31 with each taking a value between 0 and 1 indicating the probability of the next syllable being corresponding to this index.

The number of neurons in the hidden layer is set by variable as well. There is no deterministic answer on what this number should be. The generally accepted practice is that this number is bigger than input and output neurons. I played around with this number and beyond a point increasing value does not seem to increase quality of output. We set this to 128.

You can follow the rest of the code in the documentation. The training of the network is done in batches for computing efficiency. There is also step in the code that once the training data is split into input output combinations, these are then randomly shuffled before feeding to the network.

As you can see, the loss function of the network – the gap between predicted output by the network and actual output in the training data – decreases exponentially as the network is trained.

In fact, even though we have a lot more data at our disposal to train the network, as you can see that after 300 or so iterations, the improvement in the network’s ability to predict results is so marginal that we do not really have to go all the way.

Using Trained Neural Network to Produce Improvisation

Once the network is trained, we can now ask it to produce its own improvisations. This is done by priming it with at least one cycle. You can see the results. The code also validates the quality of the results produced. Specifically the results produced are all original improvisations in this case.

It is possible that the network could produce a few random cycles identical to ones in training data just by chance and that would be ok. However, if it copies too many cycles from training dataset then chances are that the network is undertrained. This is quite identical to a kid learning tabla copying improvisations from the masters before learning to produce his own.

The results also adhere to all main rules of the rhythmic cycle – i.e. the first beat should be emphasis beat, the length of the cycle is 16 beats and beats 9,10,11 and 12 follow a certain pattern (a.k.a. “khali”).

What is fascinating about this is that nowhere in the code have we programmed in any of these rules. These rules are adhered to simply because the training data follows these rules. In other words, the model would work just as well, if the training dataset had a 7 or 9 or 10 beat rhythmic cycle instead of 16 beats.

In real application then the last step would simply be to map MIDI sounds to these syllables and, well, enjoy the accompaniment.

Conclusion

While these results were expected based on the theory of neural networks I had learnt, I found it astonishing and fascinating that this worked so well.

There are still some more challenges yet to be solved though. Primary challenge is that while one can train a neural network based on massive data in cloud server environment, an application such as this will have to produce improvisation on a local device for it to be practically useful. For example, one could imagine there be a client app on one’s smartphone. The question though is, would the smartphones have enough compute to produce these results seamlessly. Both Apple and Google have been working on AI optimized processors on iOS and Android devices.

To conclude, AI age is here already. It is going to evolve fast. It will enrich our lives in the way we never imagined, but it will also raise a lot of questions – not just technical ones but also policy and ethical ones as briefly discussed even in this small example. While we can expect rapid progress on resolution of technical questions, I am not so sure about our ability as a society to resolve policy and ethical questions at the same rate. The question around AI really then is not whether it will evolve fast enough, but rather will we be ready for it fast enough.