Machine Learning on Akash

Training AI on Akash Network, The First Decentralized Cloud

Tee Yee Yang
8 min readJul 2, 2021

Artificial Intelligence — A Glutton for Computer Power

Will artificial intelligence take over the world? Sometimes it seems that way, for everyone has heard of AI models that can compose entire essays or generate realistic face images of people who don’t exist or create images from text descriptions.

But these AI outcomes don’t come cheap: the AI model must be “trained” by recreating countless permutations of those outcomes, and that training eats up enormous amounts of computer power on mammoth GPUs. XLNet from Google, for example, can cost around $61,000 to train each time, without guaranteed results.

Not all AI models are as complicated as the Google one cited above. Others cost less to train. But cost still erects a large barrier for anyone wanting to create AI models. Most hobbyists, practitioners, and engineers at small firms simply cannot go out and buy a powerful GPU. So they naturally turn to using cloud services like AWS, Microsoft Azure, and Google Cloud to train their AI models. Despite the convenience and relatively low initial costs of using big cloud services, these AI developers will inevitably encounter a multitude of problems.

These people do not exist! Source

Clunky Setup

Setting up and using a cloud account can get very complicated. The UI (User Interface) is clunky and difficult to use. Getting a GPU up and running on AWS requires more than 20 steps, and those steps don’t include setting up the account, cost and billing, and ongoing management.

Cost Builds Up Over Time

As time goes by, the costs of running an AI deployment on AWS or other large cloud service can escalate. Incumbent cloud providers advertise cheap setup costs, attract developers, startups and firms to their platform by providing free credits. As we all know, Machine Learning and AI are highly iterative processes, and they typically require an enormous amount of compute resources. Those initial discounts don’t last forever, so it is not uncommon to hear of AI developers breaking the bank on their AWS bill.

Privacy and Censorship

Deploy a website or other operation on AWS or other big cloud service, and it might just disappear overnight. That’s what happened to Parler, a conservative version of Twitter. Turns out that Parler was used by many of the January 6, 2021, insurrectionists who occupied the U.S. Capitol. AWS — where Parler was deployed — promptly shut Parler’s website down. Apple and Google followed suit and disabled Parler’s ability to sell its app on the App Store and Google Play.

As pointed out by Time.com:

Amazon[‘s] kicking a company off AWS, however, can be a death punch. AWS isn’t an app store, it’s a cloud computing service. In ye olden times, companies that wanted to do much of anything having to do with the Internet generally had to run their own servers, a complicated, costly and time-consuming enterprise mostly reserved for the largest firms. Then came cloud providers like AWS, which rent servers (and offer myriad other services) on demand — you or I could go over to AWS and have something running on AWS servers in minutes.

Time continues and points out that the convenience of running an operation on the cloud comes with a big cost:

But that convenience came at a cost: modern Internet services are increasingly built on AWS and its rivals, like Microsoft Azure and IBM Cloud. That has given those firms tremendous sway over what conduct is and is not acceptable on the Internet — in terms of free speech, they have become even more powerful than, say, Apple. It’s one thing to stop offering an app, it’s another to destabilize or block another company’s entire online operation.

Now There’s Another Way to Get Cloud Computing

Akash Network is the first decentralized and permission-less cloud-computing service. Quite simply, Akash is a marketplace. It enables those having excess computer power to lease it to those needing computing power. Those wanting compute power can establish an Akash account and bid for compute power. Those offering computer power can decide whether they want to accept that bid.

The one providing computer power on the Akash Network is called a Provider. The entity wanting to lease power is called the Tenant. Thus, one can think of the Akash Network as the Airbnb of cloud computing.

Establishing an Akash Account

One who wishes to become a Tenant simply goes here and starts the process. The website provides clear instructions on the steps a Tenant takes to deploy an operation on the Akash Network.

The Tenant will pay for compute power by purchasing a supply of the cryptocurrency called AKT. The Tenant can then “stake” that AKT with one of the 75 Validators on the Akash Network. By staking AKT, the Tenant will earn 55% APR in additional AKT coins. And if the Tenant re-stakes the rewards each day, the Tenant’s return jumps to 70%. In other words, if a Tenant buys 1000 AKT coins, stakes them, and compounds them by re-staking daily rewards, after one year the Tenant will have 1700 AKT coins.

The point is this: By buying more AKT than needed to fund a deployment on Akash Network, the Tenant can recover some of the deployment costs through (1) staking rewards and (2) appreciation of the value of AKT (assuming that value increases).

Deploying a Deep Learning Training Job on the Akash Network

Setting up a deployment on Akash is very simple if you are already familiar with working with containers and dockers in general. The source code for this demo/guide is available at https://github.com/yeeyangtee/pytorch-docker-cpu . The main steps unfold like this:

1. Create a docker image with your training script. The docker image in this demo enables the 4 key operations: download data > run training > allow monitoring of training > output trained model weights.

2. Install Akash and fund your deployment wallet.

3. Deploy the image to an anonymous compute provider using Akash.

4. Check in on your training progress and download the model weights when done.

Docker Image with PyTorch

Source

We are using the PyTorch framework, one of the most popular open source deep learning frameworks. The training is based on this tutorial on transfer learning. In this docker image, three processes are run simultaneously.

  • Python: for doing the data handling, model training and logging of results
  • Tensorboard: A visualization toolkit that allows us to easily monitor deep learning training progression from any browser window
  • sshd: A hacky solution to access the deployment and grab the trained model files after training is completed. This part was based on this tutorial from the Akash community.

The corresponding Dockerfile can be found in the same repo for source code, and the built docker image can be accessed here.

Installing Akash and Funding the Deployment Wallet

There are scores of informative resources that show how to install Akash and fund the deployment wallet. I will be showing only the key steps required for Akash setup and deployment.

The Precise Steps to Take to Deploy an AI Operation on Akash

1. Install Akash software: download binaries from the release page and install the akash binary into your system path.

2. Create a wallet with the ‘keys add’ command. You can use any name you want, in this example it is ‘mlonakash’. Next, we set the account address as a shell variable to be used later.

akash keys add mlonakash
export AKASH_ACCOUNT_ADDRESS=”$(akash keys show mlonakash -a)”

3. Fund your account. Purchase AKT from any of the exchanges listed at https://akash.network/token. Send AKT to your account address created above.

4. Connect to Akash Mainnet. Use the following commands to set up the required environment variables.

AKASH_NET=https://raw.githubusercontent.com/ovrclk/net/master/mainnet
AKASH_VERSION=”$(curl -s “$AKASH_NET/version.txt”)”
export AKASH_CHAIN_ID=”$(curl -s “$AKASH_NET/chain-id.txt”)”
export AKASH_NODE=”$(curl -s “$AKASH_NET/rpc-nodes.txt” | head -1)”

Deploying your Docker image

I set up the deploy.yml according to Akash’s SDL. This is similar to a Docker Compose file. I also arbitrarily set the resources requirements to CPUs = 8, Memory = 16GB, Storage = 32GB. You shouldn’t need that many CPUs, but it’s pretty cheap anyway. You need to follow the steps here to set the pubkey environment variable in the SDL file. This ensures that we can access the deployment after training.

Next, I just used the Akash deploy tool to send in the deployment. This assumes that you have all the important environment/shell variables set already.

SET 
$AKASH_NODE
$AKASH_CHAIN_ID
$AKASH_NET
$AKASH_KEY_NAME
$AKASH_ACCOUNT_ADDRESS
akash deploy create deploy.yml --fees 5000uakt

The deploy tool usually works quite smoothly, automatically creating the order, waiting for bids and selecting the lowest bidding provider. I use this because I’m pretty lazy, but the Akash deploy docs give a detailed step-by-step method to do this! Make sure to set the variables $AKASH_DSEQ, $AKASH_OSEQ, $AKASH_GSEQ and $AKASH_PROVIDER. Next, we send the manifest to the provider:

SET
$AKASH_DSEQ
$AKASH_OSEQ
$AKASH_GSEQ
$AKASH_PROVIDER
akash provider send-manifest deploy.yml

This should not give any output if everything goes well. Now, it’s time to get down to business. We need to access the lease status so that we can find out the hostnames and exposed ports. The command to do that is:

akash provider lease-status

If you have done everything successfully, it should look like this:

For this, external port is 30976 for SSH and 32501 for tensorboard

Check on training and download trained model

Since we are running Tensorboard on port 6006 on our deployment, we can simply access it by entering hostname:externalPort into our browser. In the example below, you can see the training status being recorded in real-time as the training is progressing on the decentralized cloud!

Tensorboard, powered by Akash for monitoring of deep learning. Its awesome.

Last of all, once the deep learning training has been completed, we can access the deployment through SSH. RANDOMPORT below is 30976 for this example, it will be different in a separate deployment. We can then copy the train model files to our local machine, or upload it to an external storage server.

ssh root@hostname -p RANDOMPORT

Concluding Thoughts

This is still largely a proof of concept that it is simple and cost effective to run deep learning containers on Akash. Many components were patched together in a hacky way to make this work. There are still a few developments needed before I would use it for serious training jobs. But the Akash Team is very responsive and will surely provide:

  • GPU support
  • Native shell access to containers/deployments

So the day will come when large AI training takes place on the Akash Network. The Network is permissionless. It is decentralized. And, perhaps most importantly of all, it costs about one-third the amount charged by AWS and the big cloud oligarchy.

--

--

Tee Yee Yang

PhD Candidate at Nanyang Technological University, Singapore