Wumpus, 2003

The Wumpus example is widely used in the artificial intelligence research as a testbed for new ideas and comparison with the previous ones.

The need for some nice way to show the Reinforcement Learning techniques in action raised for the Reinforcement learning presentation. Wumpus proved to be exactly what was needed.

Description

The Wumpus example contains several objects interacting one with another in the virtual environment. The environment is simplified (board-like environment divided into fields, like the chess board) in order to allow easier simulation. In different flavors of the task different interactions are possible.

There is a Treasure Hunter searching in the cave for the gold. However, such a cave can be full of dangerous obstacles. There might be holes, but even worse, there might be Wumpus, little stinky cave creature, waiting for the its victim!

In the classical case, the cave is really dark. This means that the Treasure Hunter cannot see anything. Luckily for him, there is a freezing fresh air going out of the holes and Wumpus is really stinky. This helps the Treasure Hunter to make a decision in every step, whether to go further or temporarily retreat and try the other way.

The classical Treasure Hunter has also three arrows that he can use to shoot the Wumpus on the next field of the board.

Simplified implementation

For the presentation purposes the environment was simplified even more. There are no more holes, but there are more Wumpuses. The Treasure Hunter does not have arrows any more. He has to find the peacefull way. It is guaranteed, that in the generated random cave there exists a peacefull way to the gold.

Our Hunter is made reactive (mostly). This means it has almost no memory and does not build the map of the cave in the mind. Every decision in every step is based only on following facts:

And the Treasure Hunter has following options:

The trial is over, either when the Hunter steps on the Wumpus, or when he finds the golden treasure.

Reinforcement learning

The presented method, Reinforcement learning, is based on the delayed rewards and punishments. This means, that while the learning entity, agent, acts, time to time it receives reward, when something desired occurred, or punishment, when something undesired happened. This received reward, however, does not belong only to the last action. Nice and simple example is chess: The reward, win, is awarded only after the long sequence of moves, while it may not be clear, which move was the winning one. Similarly, the punishment, lost game, comes also after a long sequence of moves. When the game was lost, it does not necessarily mean, that the last moves were wrong. Maybe there was the mistake in the beginning of the game, but all the following heroic effort was useless.

The goal of the Reinforcement learning is to infer, under such circumstances, which behavior, policy, leads to the success and which to the failure.

Comparison with other method

To give the better feeling, how successfull, or unsuccessful, Reinforcement learning in our scenario is, the best known deterministic (predictable) algorithm was implemented: uninformed depth-first search.

Uninformed depth-first search

Uninformed, because the algorithm has no preferences, heuristic, where to go. No "I would prefer to go left" rule.

Depth first search algorithms go forward, as long as it is possible, than retreat, when it is not possible (smelly danger). After reaching the last point, where another decision option was possible, this option is tried in the same fashion.

Results

If we would not consider the fact, that the Reinforcement learning presentation went fine, but if we would be also interested in the comparison results, we would be highly surprised.

Even if it can be proved, that the uninformed depth-first algorithm is the optimal under conditions, where no assumption about the treasure position can be done, Reinforcement learning performs better. After about thirty learning trials the learning Treasure Hunter is on average two steps faster than the preprogrammed one.

But how can this happen? Proof is a proof, after all, isn't it?

Well, the proof says, that a fixed strategy like "search along the left wall", or "search along the right wall", or "search in the center" cannot on average perform better, than the depth-first search. However, the learning Treasure Hunter learns to switch(!) these strategies. When there is something stinky found by the left wall, the "search on the left" is dropped in favor of another strategy. This highly sophisticated behavior makes the proof for non-switching policy irrelevant.