Scheduling Deep Learning models made easy
The project is still under development. You will notice significant changes in the coming days.
This project is my graduate thesis project. I call it "schedulearn" because it is a system that helps you schedule deep learning models. The main objective of this project is to help you schedule deep learning models with the least amount of effort. Thus, users can focus more on develiring quality models, not on worrying about resource management.
In this post I will be explaining how this system works in the simplest way possible.
In the figure above, you can see that there are three main components in the system: the API, the servers, the user interface. Those API endpoints are responsible for handling users requests from the user interface, such as creating, updating, and deleting models. The database is responsible for storing the models' metadata. The scheduler, which is built in the same place as the API, is the core of the system and is responsible for scheduling the models depending on the resources available in the cluster. Meaning, the system will tell where each model should be trained at and how much resources each model should use.
There are three scheduling algorithms exist within the system:
- First-In-First-Out (FIFO)
- Round-Robin (RR)
- Elastic First-In-First-Out (EFIFO)
When a job is submitted, that particular job will undergo several steps before its results are being sent back to the user.
- The job is sent to the API
- The API will save the job's metadata in the database, schedule the job, and send the job to its corresponding server.
- The server will then pull the job from the API and start training the model.
- The server sends the result back to the API and save the result in the database
- The API sends back the result to the user
The system consists of three servers. The following of the specifications of the entire system:
- 3 x 4 Nvidia GTX 1080 Ti graphical processing units, each equipped with 11GB Video Random Access Memory (VRAM)
- Intel Xeon E5-2678 v3 with 48 cores running at 2.50GHz
- 128 GB Random Access Memory (RAM)
- 10G PCIe NIC network card
Due security measures imposed by the lab that I worked for, I am unable to train models with multiple GPUs from different servers. However, assigning more GPUs does not translate to 1 to 1 performance increase. Meaning that training a model with 4 GPUs does not mean that the model will be trained 4 times faster, assuming that 1 GPU is 1x speed.
To overcome this issue, each job will only be assigned to GPUs from the same server. Not only that is the only work around, but it is one of the ways to reduce communication costs. Having multiple GPUs requires the data will have to travel more, which already imposes a significant overhead. The further the data travels, the more overhead it will impose. Therefore, it's better to asssign multiple GPUs from the same server to a certrain job.