<iframe title=GTM-TDJFNV8 cookieFlags="samesite=none;secure" src="https://www.googletagmanager.com/ns.html?id=GTM-TDJFNV8" height="0" width="0" style={{ display: 'none', visibility: 'hidden', }} />Reinforcement Learning | Consensys

Reinforcement Learning

This post was authored by Vanessa Bridge, Research Engineer, at PegaSys. Introduction In this post we describe the use ...
by ConsensysDecember 9, 2019
reinforcement vanessa

DISCLAIMER: In October 2020 PegaSys was renamed to Quorum.

This post was authored by Vanessa Bridge, Research Engineer, at PegaSys.


In this post we describe the use of reinforcement learning (RL) agents as a tool of signal detection of fairness, the idea that dishonest parties have no advantage over honest ones in distributed protocols. We assume the reader has basic blockchain knowledge and some familiarity with terms such as agent and environment in the context of RL. We focus specifically on training an agent that interacts with a simulated proof of work protocol (ETH POW) to test automated detection of dishonest mining strategies whose main goal is to increase miner revenue. The Ethereum consensus protocol was selected as the subject of this study due to the previous work done highlighting attacks which can be used to exploit the protocol’s design to increase rewards, as shown in the selfish mining attack[1], which we describe next. We note that given current hashing power distribution among mining pools and individual miners, as seen below, it is not feasible to engage in selfish mining. Furthermore, pools are not incentivized to engage in such behavior as it would undermine the trust people have in Ether and the MainNet, and provoke miners to switch pools.

Mining in Ethereum

In order to explain selfish mining we must first look at the ETH POW protocol and define what the expected behavior is,  i.e. “honest mining”. There is an understanding that the Ethereum protocol incentivizes miners to publish blocks as soon as they have solved the cryptographic puzzle for the nonce, as this enables them to claim the block reward and allows other miners to start mining on the new chain head. For the purpose of this post we refer to this as honest mining.

Majority Attack or 51% Attack

“The system is secure as long as honest nodes collectively control more CPU power than any cooperating group of attacker nodes.”[1] In Satoshi’s seminal paper, we see the introduction of an attack known as a majority attack, where a node or pool, possessing more than 50% of the total hashing power, could essentially modify past blocks thus increasing their revenue by reverting past payments they made or generating new coins for themselves.

Selfish Mining 

Miners have few restrictions on what their actions, and one could imagine attacks exploiting:

  1. Block propagation: a miner can decide when to share their mined block, no need to share immediately
  2. Timestamp validity:  the algorithm only checks that the timestamp is within current time + 15 seconds.  An attacker could manipulate the timestamp in their block to win in a fork.
  3. Latency: a miner could track latencies and use them in their favor. 

Thus a miner can choose to withhold their block in a private chain (fork) and build on it to gain an advantage over other miners and only release their “secret blocks” as they receive blocks from other miners at the height of their oldest “secret block”. Emin G. Sirer coined the term selfish mining to describe the previous behavior in Bitcoin[2]. In Ethereum uncle blocks play into this attack [3]  as they can include competing blocks as uncles and gain more weight on their fork making it the canonical one. 

In synthesis, the attack consists of holding blocks strategically, and releasing only when convenient (i.e.: whenever you receive a competing block for a previously mined height). This allows, for instance, a miner with 35 % hashrate to get higher rewards than probabilistically expected.

Figure 2: Total and individual miner revenue with different hash power and either honest or selfish nodes.

Results found through our ETH POW simulations with an implementation of selfish mining described in [3].

In the image above we see a fork with the secret chain, where a selfish miner with enough hashing power has obtained a lead over the rest of the miners, but only reveals up to the current height honest miners see (4), giving it a lead of 2 more blocks (5 & 6) which remain hidden until another block is published by an honest miner. By continuing to do this, the miner can take the lead, forcing others to continuously try to catch up. This is possible due to the probabilistic  nature of the mining mechanism where a block’s nonce (cryptographic puzzle) is found with the same probability as the miner’s hashing power i.e.: if you have 25% of the hashing power it means you will find the solution with 25% probability. This property is what allows miners with a certain hashing power to take a lead over others. 

Reinforcement Learning

Reinforcement learning is the process through which an agent learns from their interactions with an environment through predefined actions. Reinforcement learning is often modelled as a Markov Decision Process which is particularly beneficial in the context of signal detection of fairness, since the Markov property allows us to be concerned with only the previous state and not the entire history of states that built the current state. We selected reinforcement learning as a possible tool that can discover similar attacks due to the probabilistic nature of the properties described above. Our intuition was that if an agent can find new paths in a video game and score higher than humans can[4], then it could possibly find different strategies in a consensus protocol. 

In the case of ETH POW mining the state is represented by the number of blocks you keep secret and the distance between your current highest block and the honest miners highest block. For a concrete example we can look back to Figure 3) and see that based on the state description we would have the tuple [4,2].  As the miner has 4 blocks in a secret fork and has only shared 2 of those. 

In order to test our theory and have an agent running in Ethereum we decided to use a simulated version of the ETH POW protocol that implements the same logic used in the main chain. We use the difficulty calculation as specified in EIP 100, where uncles(a block with the same parent) are taken into account and use the same probabilistic mechanism for miners.

We prefer the use of a simulated Ethereum protocol as many challenges arise when trying this on the real Ethereum:

  • Time: a block is mined every 15 seconds, whereas a simulation can use its own time and thus accelerate the production of data and results
  • Replicability: in order to validate results one must be able to reproduce them, which can be achieved in a controlled environment like our simulation
  • Parallelisation: we can run multiple instances at the  same time with different goals for the agent and with different learning rates which would not be possible in the real Ethereum.


To test whether we could discover dishonest mining strategies using an AI agent we decided to leverage existing frameworks used in the area of reinforcement learning algorithms. We built an agent in Python using the GYM framework, a tool that supports comparing different reinforcement learning algorithms as well as teaching agents. We built the simulated ETH POW in Java, using our own framework called Wittgenstein, which comes with out of the box network (with P2P, Gossip-sub) and a setup for nodes, latencies and messaging. 

We setup the protocol so that every time the agent mines or receives a block, it selects an action through the algorithm and sends it back to the simulation. As figure 4) shows, this is an iterative process, each time an action is required the agent will select one, send it and receive a reward and data from the state of the environment (observations). The agent will record the rewards received and use them later to make decisions about the best strategy. 

We chose a relatively simple algorithm to remove unnecessary complexity. However, there are numerous other algorithms that may be used, and reasonably complex models may be built to discover more interesting results for attacks. The algorithm we selected is based on Q-Learning, where a table is created (called Q) whose rows represent all possible states and whose columns represent all possible actions. The algorithm “Q-learning finds a policy that is optimal in the sense that it maximizes the expected value of the total reward over any and all successive steps, starting from the current state”[5]

In other words, the agent finds the optimal path by recording the values at each state action and then selecting the path which returns the maximum reward. 

For the purpose of training the agent, we must establish the set of actions A that the agent can take. In a game this is analogous to choosing left or right, in our case we focus on block propagation, whether you publish a block after you mine one or whether you add it to your secret fork.


  •  The set of possible actions A={1,….K} the agent can take:
    •  0: hold 1 block
    •  1: publish 2 blocks in a row
    •  2: publish 3 blocks in a row
    •  3: add 1 block to private chain 

We must also define the data from the world-state that the agent has access to in order to learn. They reflect the changes after an action is taken, and help the agent learn to move from random actions into strategic ones.


  • The observations received, partially-observed:
    • Distance between your highest block and the highest block in the secret height 
    • Number of blocks we have in a row from the current head

Complementary to the world-state, the agent also receives a reward with each action, which can be positive or negative and is used in the Q-learning algorithm to calculate which set of actions delivers the highest reward. We tested a number of different rewards, but found that the agent learned best by rewarding advance over others.


  • The rewards associated after an action is taken and the world state is modified. 
    • +Num_of_blocks_advance, number of blocks you have in advance  
    • – Num_of_blocks_behind, if you lose the lead you get negative rewards  

Test Scenario

For simplicity we implemented a simulation of ETH POW, where we only run 9 mining nodes, and 1 node that uses reinforcement learning to strategize it’s block publication scheduling. 

For the action selection we run a simple Q-learning algorithm. The basic idea relies on creating a table that records rewards received at all possible state action combinations in a table Q[state,action]. The table is initialized with all values in each row and column to 0.

Initially the table gets updated by taking random, valid actions (i.e.: you can only send a block if you’ve successfully mined one), and recording received rewards. 

As new rewards are found, the algorithm uses an equation to define how much an individual action affects previously recorded ones by following this formula:

Equation 1: Bellman equation

Where,  rt is the reward received when moving from the state st to the state st+1 by taking action at, α is the learning rate and 𝛾 is the discount factor.

The reward is calculated on the immediate reward r received after action at, plus the reward of the next action and state given that you are at state s. The reward for the next step is discounted by the value 𝛾.  The code implemented looks as follows:


Note: full code can be found here

We use 3 parameters with values between [0,1]:

  • 𝜺: determines the choice between a random action and one with the highest q-value
    • Defines 2 states in the algorithm: exploration stage is when the choice is more random, i.e. 𝜺 is closer to 1, and the exploitation stage is when the choice of action favours the highest q-value, i.e. 𝜺 is closer to 0.
  • 𝛾: To account for future rewards
    • The closer you set it to 1 means the more you value future rewards. This can be used to account for inflation or to promote long term strategies, i.e. allow for the agent to learn delayed gratification.
  • α: To define how much the values learned are influenced by newer values calculated
    • Determines to what extent newly acquired information overrides old information, it’s also called the learning rate and it’s usually set to a value closer to 0, to avoid over learning, which is tested over different values.  

After a fixed number of actions have been taken, the agent will switch from taking random actions to using the algorithm which selects the action with maximum reward at its respective state, as logged in the Q table above.

The algorithm we implemented balances between the explore and exploit states to avoid falling in paths that would constitute a local maximum. 


We used the following values for α, 𝛾, 𝜺:

α 𝛾𝜺Hashrate

For each α value we test for all the different 𝛾 values and set 𝜺 to start with  0.9 as initial value and decrease it to the other values until  it reaches 0.1. We ran a total of 16 simulations for each of the hashing rates listed above in order to see if the agent can learn selfish mining or other more optimal strategies, at 10%, 40% and 60% hashing rate. 

As mentioned above,  we start with an α value of .5 as we don’t want the agent to over estimate some rewards, and selected a few values closer to 0 to see at which rate the agent learned best.

We maintain at 𝛾 a higher value as we consider the time between mining blocks to be negligible and thus the future reward received is almost as good as a present reward.

Mining strategies

We expected to see some indication that the agent would try to engage in dishonest mining, and we expected it to successfully learn this strategy when the agent had 40% and 60% of the total hashpower. We recall the strategies mentioned in the introduction:

  • Honest mining: the miner follows the protocol as described: as  soon as it solves the nonce calculation it publishes that block to the entire network and starts building a new one
  • Selfish mining: when  a miner has less than majority of the hashing power (<50%) but can strategize the release of mined blocks and thus increase its revenue
  • Majority mining: When a miner has more than 50% it is essentially able to control the entire network, there is no incentive to mine honestly and thus can continue to build in its own fork knowing that they will win. 


We collected all the results from the simulations for all the combinations of α, 𝛾, 𝜺 and hashing power and present the following results:

We tested for 2 different reward functions for the agent, one where the rewards is tied to the number of blocks you have a lead on, as seen in figure 2, this is looking at the distance between your height and the current known head. The second reward function is based on tracking whether or not you published the last known block in the common head. 

As expected the most optimal strategy at 10% is to be honest, as probabilistically you could not have a lead from others when your hashing power is so low. 

At 40%, we see that the agent starts to gain an advantage over honest miners and the reward for the dishonest miner is increased compared to an honest miner with the same hashing power. In this scenario the miner starts having a slight advantage, as seen in figure 5) where normally the hashing power also represents the percentage of blocks mined; for a selfish miner at 40% we see a rate of almost 50% of the published blocks are theirs. For the miner using a reward function based on the lead, you also see a better rate, however it is not performing at the levels of the selfish mining simulation we implemented. We attribute this to the need for further refinement of the parameters (α, 𝛾, 𝜺), which can be time consuming. Furthermore, we saw no new attack vectors discovered by the agent, we could only successfully train it to find selfish mining, and thus we leave this exploration as future work.

Finally at 60%, the miner has a lead and it’s in its best interest to mine on his own chain.  At this hashing rate, if the selfish mining attack is successful you also find that the agent with reward function based on last published block has dramatically increased the percentage of published blocks, which is close to 95%. As mentioned before, when you have more than 50% of the hashing  power there is no incentive to be honest as you can dominate the network and thus produces almost 90% of the blocks, as others have difficulty catching up to you.

We note that although  results of agents are better than honest mining, there is still more work needed to improve the agents strategy as the agent does not outperform or achieve the same results as the selfish miner.

Key Takeaways 

Despite selecting one of the simplest reinforcement learning algorithms, we successfully trained an agent that is able to find a dishonest strategy and to outperform honest mining, by incentivising with rewards focused on keeping a lead.   

We could generate more accurate models with more data by adding more observations received by the agent such as latency, or by modifying timestamps in blocks as well as to  test with other reinforcement learning algorithms and more complex models where the reward function incentivises other strategies, however we leave this for future work. 

Lastly, we want to encourage participation from others, to test these techniques. We contribute our results and framework as a starting  point for anyone trying to use reinforcement learning to detect protocol design flaws. 


[1] Satoshi Nakamoto,  “Bitcoin: A Peer-to-Peer Electronic Cash System”,2008. Access: https://bitcoin.org/bitcoin.pdf

[2] I. Eyal and E. Gun Sirer, “Majority is not Enough: Bitcoin Mining is Vulnerable”,  2014.

  Access: https://www.cs.cornell.edu/~ie53/publications/btcProcFC.pdf

[3]  C. Grunspan and R. Perez-Macro, “Selfish Mining in Ethereum”, 2017. Access: https://arxiv.org/pdf/1904.13330.pdf

[4] v. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, “Playing Atari with Deep Reinforcement Learning”,2013. Access: https://arxiv.org/pdf/1312.5602v1.pdf

[5] F. Melo, “Convergence of Q-learning: a simple proof”, 201 .Access: http://users.isr.ist.utl.pt/~mtjspaan/readingGroup/ProofQlearning.pdf