Hackwek - Hallucinating Solidity Source Code

Diligence Hackathon - HackWek

Buidling, breaking, hacking, making! 🥷⚔️ Testing boundaries and playing with experimental technology is what we love at Diligence.

In this spirit, “HackWek” was born. A recurring Diligence internal five-day hacking party 🥳. In this episode, I set out building a Solidity source code writing robot 😵‍💫🤖.

Hallucinating Solidity Source Code

Some time ago I’ve started to collect smart contract samples from public block explorers with the smart-contract-sanctuary project. Initially, for no special reason, but it quickly turned into a treasure trove for all kinds of activities. So, what can we do with it?

Well, there’s one thing: 👉 Feed everything into a Recurrent Neural Network (RNN) and let it spit out Solidity-like source code.

The experiment

github.com/Hallucinate.sol contains everything you need to reproduce this experiment. It is based on the excellent TensorFlow text-generation tutorial and trains the text-prediction model with Solidity source code from the smart-contract-sanctuary. The goal was to create something like scigen - an academic paper generator - for Solidity, which could test source-based tools’ robustness. Text prediction is created on a character-by-character basis, which is pretty sub-optimal as contextual information (structure/grammar/tokenized input) would be readily available from parsing the input. Still, I decided to leave that for the next hackathon! 🙌.

Generate Solidity Code

If you just want to generate random Solidity-like text, you can use my pre-trained model:

  1. Go to 👉 Tutorial 2 - load & hallucinate
  2. Copy the Python notebook to your Google Drive.
  3. Open it in Google Collab.
  4. Run it.

This will load a pre-trained solidity model into TensorFlow. After that, you can ask the model to “predict” the following x=3000 characters of solidity code.

image

Train the model yourself

  1. Go to 👉 Tutorial 1 - train & hallucinate
  2. Copy the Python notebook to your own Google Drive.
  3. Open it in Google Collab.
  4. Run it.

Tweak the parameters! Improve the soliditygen.py module to use tokens instead of character prediction. Export it to tensorflow.js and use it on your website.

Lessons Learned

This was a really fun project to work on! But, acknowledging the time constraints of a 5-day HackWek, here are a couple of thoughts on what can be improved:

  • The vocabulary should be based on token type & text instead of characters. E.g., use Pygments to lex solidity and map this as the vocabulary. This should allow the model to learn the language’s context faster than an inefficient character model.
  • Create a better and more reliable way of input normalization (reliably remove comments/pragmas/whitespace chars/etc.).
  • Customize the loss function to reinforce training towards fuzzy-parseable code.
  • Shuffle before downloading contract sources.
  • Continuous learning. Re-train with more sources (instead of only 15 MB 😂). Fix the continued training part.
  • The pre-trained solidity_model_text is pretty bad and will generate a lot of garbage. For example, 15 epochs of training are not enough, and the text-based shuffling approach makes no sense. But at least it is generating something! 😂

Thinking about smart contract security? We can provide training, ongoing advice, and smart contract auditing. Contact us.

All posts chevronRight icon

`