Fundamentals of Artificial Intelligence

0 Introduction

0.0 Book Resources

Artificial Intelligence A Modern Approach, 3rd Edition

0.1 AI

Artificial intelligence (AI) is the branch of computer science that develops machines and software with intelligence.

0.2 The Turing Test

The Turing Test was designed to provide a satisfactory operational definition of intelligence.
2 people & 1 machine
- an interogator: to find out which candidate is the machine or human, only by asking them questions.
- If the machine can fool the interrogator 30% of the time, the machine is considered intelligent.

1 Problem Formulation

1.1 Problem Formulation

Problem formulation is the process of deciding what actions and states to consider, given a goal.

1.2 Problem Components

Initial State: the starting state of the problem, defined in a suitable manner.
Actions (Operators): an action or a set of actions that moves the problem from one state to another.
- Neighbourhood: The set of all possible states reachable from a given state by applying all legal action(s).
- Successor Function: the action(s).
Goal Test: a test applied to a state which returns true if we have reached a state that solves the problem.
Path Cost: how much it costs to take a particular sequence of actions.

A solution to a problem is an action sequence that leads from the initial state to the a goal state.

State Space: the set of all states reachable from the initial state (it forms a directed network or graph), determined by the initial state and the successor function, decides the the complexity of a problem based on its size.

1.3 Problem Representation - Graph

Graph: to model the deeper structure of a problem
- Consists:
  - A set V of nodes
    - has a unique label for identification
    - represents possible stages of a problem
  - A set E of edges
    - is the connection between two nodes
    - represents inferences, moves in a game, or other steps in a problem solving process (operator)

1.4 Search Trees

Tree: a graph that
- is connected but becomes disconnected on removing any edge
- is connected and acyclic
- has precisely one path between any two nodes

Nodes
- Root node
- Children / parent of nodes
- Leaves
Branches
Average branching factor: average number of branches of the nodes in the tree
Depth
- of a node: the number of edges (branches) away from the root node
- of a tree: the depth of the deepest node

Problem Representation

State: nodes

Initial State: root node

State Space: all nodes in the tree

Neighborhood: the set of all possible states reachable from a given state

Number of neighborhoods: branching factor

Operator(s): branches

How big can a search tree be?

A branching factor b

One goal exists at depth d

b^d different branches in the tree

1.5 Properties of Search

Complete Search Methods' properties
- If a goal exists then the search will always find it
- If no goal exists then the search will eventually finish and be able to say for sure that no goal exists

1.6 Evaluating a search

Completeness: Is the strategy guaranteed to find a solution?
Time Complexity: How long does it take to find a solution?
Space Complexity: How much memory does it take to perform the search?
Optimality: Does the strategy find the optimal solution where there are several solutions?

1.7 Search Tree Issues

Wasteful on space
Difficult to implement
Not always obvious which node to expand next. We may have to search the whole tree looking for the best leaf node. This can obviously be computationally expensive.

2.1 Categories

Uninformed / Blind Search
- BFS
- DFS
- UCS
Informed / Heuristic Search
- Greedy Search
- A* Search

2.2.1 Fundamental actions

Expand: ask a node for its children
Test: test a node to see whether it is a goal

2.2.2 General Tree Search

Function General-Search(p, QUEUING-FN) returns a solution or failure.

nodes = Make-Queue(Make-Node(Initial-State[p])) 
Loop do 
	If (nodes is empty) then { return failure }
	node = Remove-Front(nodes) 
	If (Goal-Test[p] on State(node) succeeds) then { return node }
	nodes = QUEUING-FN(nodes, Expand(node, Operators[p])) 
End

2.2.3 General Graph Search

Function General-Search(p, QUEUING-FN) returns a solution or failure.

nodes = Make-Queue(Make-Node(Initial-State[p])) 
Loop do 
    If (nodes is empty) then { return failure }
    node = Remove-Front(nodes) 
    If (Goal-Test[p] on State(node) succeeds) then { return node }
    	add the node to the explored set 
    nodes = QUEUING-FN(nodes, Expand(node, Operators[p])) 
End

2.2.4 Search Strategies

Strategy: picking the order of node expansion
Evaluating strategies
- Completeness: does it always find a solution if one exists?
- Time complexity: number of nodes generated/expanded
- Space complexity: maximum number of nodes in memory
- Optimality: does it always find a least-cost solution?

Time and space complexity are measured in terms of

b: maximum branching factor of the search tree

d: depth of the least-cost solution

m: maximum depth of the state space (may be ∞)

2.2.5 Breadth First Search

Method
- Expand shallowest unexpanded node
  - Expand Root Node First
  - Expand all nodes at level 1 before expanding level 2
  - …
  - Expand all nodes at level d before expanding nodes at level d+1
- Queuing function: adds nodes to the end of the queue
- General-Search(problem, ENQUEUE-AT-END)
Evaluation
- Optimal: yes (only if the branch costs are the same)
- Complete: yes
- Time complexity: O(b^d) i.e. the number of leaves
- Space complexity O(b^d) i.e. the total number of nodes
b: the average branching factor

d: the depth of the search tree

2.2.5 Depth First Search

Method
- Expand Root Node First
- Explore one branch before exploring another branch
- Queuing function: adds nodes to the front of the queue
- Function DFS(problem) returns a solution or failure
- Return General-Search(problem, ENQUEUE-AT-FRONT)
Evaluation
- Optimal: no
- Complete: no
- Time complexity: Θ(|V| + |E|), b^m
- Space complexity O(|V|), bm

2.2.6 Uniform Cost Search (Cheapest Fisrt Search) (VS. BFS)

Method
- Always remove the smallest cost node first
- Sort the queue in an increasing order, or
- Search the queue and remove the node of smallest cost
- Function UCS(problem) returns a solution or failure
- Return General-Search(problem, ENQUEUE-BY-COST)
Evaluation
- Optimal: yes (even branch costs are different)
- Complete: yes
- Time complexity: O(b^d) i.e. the number of leaves
- Space complexity O(b^d) i.e. the total number of nodes
VS. BFS
- UCS = BFS (when all branches have the same cost (in the worst case)
- UCS is usually better than BFS

https://www.youtube.com/watch?v=dRMvK76xQJI

3 Search Strategies - Heuristics

3.1 Several AI Methods

Heuristics
Fuzzy Logic
Bayes reasoning
Game Theory
Evolutionary Computation
Evolutionary Game Theory

3.2 Definition of Heuristics

A heuristic is any approach to problem solving, learning, or discovery that employs a practical method not guaranteed to be optimal or perfect, but sufficient for the immediate goals. Where finding an optimal solution is impossible or impractical, heuristic methods can be used to speed up the process of finding a satisfactory solution.

3.3 Types of Solutions

A solution to an optimization problem specifies the values of the decision variables, and therefore also the value of the objective function.
A feasible solution satisfies all constraints.
An optimal solution is feasible and provides the best objective function value.
A near-optimal solution is feasible and provides a superior objective function value, but not necessarily the best.

3.4 Continuous & Combinatorial

Continuous: optimization problems that has an infinite number of feasible solutions
- Continuous problem generally maximize or minimize a function of continuous variables such as min 4x + 5y where x and y are real numbers
Combinatorial: optimization problems that has a finite number of feasible solutions
- Combinatorial problems generally maximize or minimize a function of discrete variables such as min 4x + 5y where x and y are countable items (e.g. integer only)

3.5 Constraints

Hard Constraints: must be satisfied
Soft Constraints: desirable to satisfy
Explicit Constraints: stated in the problem
Implicit Constraints: obvious to the problem

3.6 Search Basics

Search: the term used for constructing/improving solutions to obtain the optimum or near-optimum.
Solution: encoding (representing the solution)
Neighborhood: nearby solutions (in the encoding or solution space)
Move: transforming current solution to another (usually neighboring) solution
Evaluation: the solutions’ feasibility and objective function value
Constructive: search techniques work by constructing a solution step by step, evaluating that solution for (a) feasibility and (b) objective function
Improvement: search techniques work by constructing a solution, moving to a neighboring solution, evaluating that solution for (a) feasibility and (b) objective function
Properties: Search techniques may be deterministic or stochastic

3.7 Heuristic Search

Heuristic Search (Informed Search): Add domain-specific information to select the best path along which to continue searching
Heuristic search works by deciding which is the next best node to expand (there is no guarantee that it is the best node)
A heuristic method is particularly used to rapidly come to asolution that is hoped to be close to the best possible answer, or 'optimal solution’

Blind Search vs. Heuristic Searches

Blind search

Blindly choose where to search in the search tree

When problems get large, not practical any more

Heuristic search

Explore the node: more likely to lead to the goal state

Using knowledge, so called informed search

Heuristics: "rules of thumb", educated guesses, intuitive judgments or simply common sense.

Greedy search, A* search

3.8 Heuristic Function

h(n) estimates the “goodness” of a node n
- specifically, h(n) = estimated cost (or distance) of minimal cost path from n to a goal state
- h(n) ≥ 0 for all nodes n and h(n) = 0 implies that n is a goal node

3.9 Greedy Search

Use as an evaluation function f(n) = h(n), sorting nodes by increasing values of f
Selects node to expand believed to be closest(hence“greedy”) to a goal node (i.e., select node with smallest f value)
Pros: simple, easy to implement, run fast
Cons: often do not provide a globally optimum solution (but a locally optimum solution)
Not complete: A complete heuristic always reaches the goal if thereexists a route
- Time and space complexity is O(b^d)
Not admissible: An admissible heuristic never overestimates the cost of reaching the goal

https://youtu.be/3XaqEng_K5s

Greedy Search vs UCS

image-20180529014831509

3.11 A* Algorithm

https://youtu.be/6TsL96NAZCo

Combines the cost so far and the estimated cost to the goal f(n) = g(n) + h(n)
- g - cost from the initial state to the current state so far
- h - the cost from the current state to a goal state (underestimated)
- f - an evaluation of the state: estimated cost of the cheapest solution through n

Notes:

To be optimal and complete, the heuristic must be admissible

Admissible: must use a heuristic h(n) that never overestimate the cost

Methods:

Euclidean distance (straight line distance)

Manhattan distance (the sum of absolute pitch)

When h = 0, A* becomes UCS => complete & optimal*, but search undirected

When h too large thus dominates g, A* becomes becomes Greedy => lose optimality

Informedness

For two A* heuristics h1 and h2, if h1(n) <= h2(n) for all states n in the search space

we say h2 dominates h1 or heuristic h2 is more informed than h1.

Means:

(In efficieny,) A* using h2 will never expand more nodes than using h1.

Hence:

it is always better to use a heuristic function with higher values

provided it does not over-estimate

and that the computation time for the heuristic is not too large.

4 Fuzzy Sets & Game Theory

4.1 Fuzzy Logic

4.1.1 What

Fuzzy logic is a form of many-valued logic in which the truth values of variables may be any real number between 0 and 1. It is employed to handle the concept of partial truth, where the truth value may range between completely true and completely false.

4.1.2 Why

Ordinary (Boolean) logic assumes things to be true or false. No grey in between.
Ordinary maths is very exact and ignores the uncertainty
Statistics can be used but have assumptions built in
Fuzzy Logic is model free

4.1.3 IF-THEN rules

IF temperature IS very cold THEN fan_speed is stopped
IF temperature IS cold THEN fan_speed is slow
IF temperature IS warm THEN fan_speed is moderate
IF temperature IS hot THEN fan_speed is high

4.2 Utility Theory

4.2.1 Strong AI vs Weak AI

Strong AI: AI that matches or exceeds human intelligence
Weak AI: machines can demonstrate intelligence, but do not necessarily have a mind, mental states or consciousness

4.2.2 Some Basic Concepts

Decision Making: The act of choosing between two or more courses of action, or choosing between possible solutions to a problem.
Rationality: the quality or state of being rational, that is, being based on or agreeable to reason.
Utility is an abstract measure for the amount of satisfaction you receive from something (mainly in Economics). For example, one receives utility from eating food, watching movies, and etc.
Two axiomatic assumptions on rationality:
1. In any given situation a decision-maker always chooses the action which is the best according to his/her utility or preferences.
2. Rational choice is common knowledge among all relevant decision makers.
Rational choice: weight the advantages and disadvantages of different options (or actions), and then choose the action that maximize individual utility.

4.2.3 Law of Diminishing Marginal Utility (Gossen's First Law)

Marginal Utility: the change in the utility from an increase (or decrease) in the consumption of that good or service.
Law of Diminishing Marginal Utility: the first unit of consumption of a good or service yields more utility than the second and subsequent units, with a continuing reduction for greater amounts. Therefore, the fall in marginal utility as consumption increases is known as diminishing marginal utility.

MU₁>MU₂>MU₃......>MU_n

where MU_n = U_n - U_n-1 means marginal utility of the nth unit.

4.2.4 Expected Utility Hypothesis(Theory)

Expected Utility Hypothesis: concerning people's preferences with regard to choices that have uncertain outcomes (gambles), states that if specific axioms are satisfied, the subjective value associated with an individual's gamble is the statistical expectation of that individual's valuations of the outcomes of that gamble.

EU = ∑ p_i U_i

where U_i is the utility of ith outcome, and p_i is probability of ith outcome.

4.2.5 Decision Making under Uncertainty

Uncertainty: the lack of certainty, a state of limited knowledge where it is impossible to exactly describe the existing state, a future outcome, or more than one possible outcome.

Probability based methods

Bayesian reasoning

Expected utility theory

Game theory

...

Soft computing (non-probability) methods

Fuzzy logic

Interval probability theory

Evolutionary computation

Machine learning

...

4.2.6 Classic Game Theory

4.2.6.1 Game

Game: a situation where the participants’ payoffs depend not only on their decisions, but also on their rivals’ decisions.
- Types:
  - Cooperative Game / Non-cooperative Game
  - Static Game / Multi-stage Game
    - Static: game is played only once (Prisoners’ dilemma)
    - Multi-stage: game is played in multiple rounds (chess)
  - Complete Information / Incomplete Information
    - Complete info.: players know each others’ payoffs (Prisoners’ dilemma)
    - Incomplete info.: other players’ payoffs are not known (Sealed auctions)
  - Simultaneous Play Game / Sequential Play Game
    - Simultaneous Play Game: player chose their strategies simultaneously. (Payoffs can be described by a matrix.)
    - Sequential Play Game: one player (leader) plays before another player (follower). (Payoffs can be described by game tree.)
Game Theory: the study of mathematical models of conflict and cooperation (Strategic Interactions) between intelligent rational decision-makers.
Three elements to describe a game:
- Players: I
- Strategies: S (options each player can choose)
- Payoffs: P (how much utility each player receives for any specific outcome)
Mathematically, a game can be described by a 3-tuple G = {I, S, P}

4.2.6.2 Strategies

Strategy: Decision-maker’s choice(s) in any given situation
Strategy Space: set of all possible actions
- Finite / Infinite
- Pure / Mixed
  - Pure strategy: deterministic actions
  - Mixed strategy: randomized actions (with probability)
  E.g. in Rock-Paper-Scissors game
  - three pure strategies
  - numerous mixed strategies
    - a possible one: a player plays rock, paper and scissors each with 1/3 probability

4.2.6.3 Payoffs

Payoffs: Decision-maker’s utility in any given situation
Payoff Space: set of all possible outcomes
- Described by a Tree or Matrix
E.g. in Rock-Paper-Scissors game

4.2.6.4 Game Theory

Game theory: the study of how people behave in strategic situations.
Two classical game theory branches:
- Cooperative Game Theory: In a cooperative game, the players try to form coalitions. Within a coalition, the players make choices collectively to maximize the payoff of the coalition.
- Non-cooperative Game Theory: In a non-cooperative game, the players make their choices independently in order to maximize their own payoffs.

4.2.6.5 Dominant and Dominated Strategy

Dominant Strategy: a strategy that gives higher payoffs no matter what the opponent does.
Dominated Strategy: a strategy is dominated if there exists another strategy that is dominant. (i.e. a strategy that gives lower payoffs no matter what the opponent does.)

E.g. in Prisoners' Dilemma:

Whatever Prisoner A chooses, best choice for Prisoner B is always ‘Confess’ because -1 < 0 and -10 < -5. Confess’ is dominant strategy for B. So, in the prisoner’s dilemma. ‘Confess’ is dominant strategy for both players and ‘Keep silence’ is dominated strategy for both players.

But, in most games there are no dominant strategies for all players. We cannot use this method to predict the outcome of the game. Nash equilibrium analysis is a general tool for all games.

4.2.6.6 Nash Equilibrium

Nash Equilibrium: a solution concept of a non-cooperative game involving two or more players in which each player is assumed to know the equilibrium strategies of the other players, and no player has anything to gain by changing only his own strategy. (i.e. each player is choosing the best strategy, given the strategies chosen by the other players.)

If each player has a dominant strategy, the state that each player adopts their dominant strategies is Nash equilibrium. (e.g. ‘Confess, Confess’ in Prisoners' Dilemma)
Nash’s Theorem: A non-cooperative game involving two or more players has at least one Nash equilibrium, possibly involving mixed strategy. (J. F. Nash, 1950)
Computing Nash equilibrium:
- Pure strategy Nash equilibrium (easy to compute for finite strategy space)
- Mixed strategy Nash equilibrium (hard to compute when the strategy space is not small)
Example of Computing Mixed Strategies Nash Equilibrium:

If B plays Left, A's expected payoff is 2p_U + 5(1- p_U)

If B plays Right, A's expected payoff is 4p_U + 2(1- p_U)

If 2p_U + 5(1- p_U) > 4p_U + 2(1- p_U), then B would play only Left. But there are no Nash equilibria in which B plays only Left.

If 2p_U + 5(1- p_U) < 4p_U + 2(1- p_U), then B would play only Right. But there are no Nash equilibria in which B plays only Right.

So for there to exist a Nash equilibrium, B must be indifferent between playing Left or Right; i.e. 2p_U + 5(1- p_U) = 4p_U + 2(1- p_U) or p_U=3/5

Similarly, for there to exist a Nash equilibrium, A must be indifferent between playing Up and Down; i.e. p_L = 3-3p_L or p_L=3/4

So the game’s only Nash equilibrium has A playing the mixed strategy (3/5, 2/5) and has B playing the mixed strategy (3/4, 1/4)

A's expected Nash equilibrium payoff is: 1 * 3/5 * 3/4 + 0 * 3/5 * 1/4 + 0 * 2/5 * 3/4 + 3 * 2/5 * 1/4 = 3/4

B's expected Nash equilibrium payoff is: 2 * 3/5 * 3/4 + 4 * 3/5 * 1/4 + 5 * 2/5 * 3/4 + 2 * 2/5 * 1/4 = 16/5
How many Nash equilibria?
- a finite number of players, a finite number of pure strategies => at least one Nash equilibrium
- no pure strategy Nash equilibrium => at least one mixed strategy Nash equilibrium

4.2.6.7 Minimax

Minimax: a decision rule for minimizing the possible loss for a worst case scenario.
- (to solve zero-sum games that can be described by a game tree.)
Minimax Value: the smallest value that the other players can force the player to receive. If a player adopts the minimax strategy, the minimax value can be guaranteed no matter what strategies other players adopt.

https://youtu.be/zp3VMe0Jpf8

4.2.7 Evolutionary Game Theory (EGT)

4.2.7.1 Problems with Classical Game Theory

Assumption of rationality: not always appropriate
Limitation of Nash equilibrium: maybe too many Nash equilibria
Difficulty in computing Nash equilibrium: NP hard

4.2.7.2 Some Basic Concepts

Darwinian evolution is based on three fundamental principles: reproduction, mutation and selection.
Evolutionary Game Theory (EGT) is the application of game theory to evolving populations.
- An evolutionary version of game theory does not require players to act rationally – only that they have a strategy.
- It focus on the dynamics of strategy change in the population.
Genotype: the genetic type of a player
Phenotype: the behavior of a genotype
Fitness: a measure of the success of a phenotype
Selection: successful genotypes out-reproduce unsuccessful ones
Mutations: random creation of new genotypes
Invasion: mutations that successfully out-compete the current genotypes and increase in number
Evolutionarily Stable Strategy (ESS): is a strategy which, if adopted by a population in a given environment, cannot be invaded by any alternative strategy that is initially rare.

4.2.7.3 Models of EGT

EGT encompasses Darwinian evolution, including competition (the game), natural selection (replicator dynamics), and heredity.

Components:

Population
Game
Replicator Dynamics

Phases:

The model (as evolution itself) deals with a Population (Pn). The population will exhibit Variation among Competing individuals. In the model this competition is represented by the Game.
The Game tests the strategies of the individuals under the “rules of the game”. These rules produce different payoffs – in units of Fitness (the production rate of offspring). The contesting individuals meet in pairwise contests with others, normally in a highly mixed distribution of the population. The mix of strategies in the population affects the payoff results by altering the odds that any individual may meet up in contests with various strategies. The individuals leave the game pairwise contest with a resulting fitness determined by the contest outcome, represented in a Payoff Matrix.
Based on this resulting fitness each member of the population then undergoes replication or culling determined by the exact mathematics of the Replicator Dynamics Process. This overall process then produces a New Generation P(n+1). Each surviving individual now has a new fitness level determined by the game result.
The new generation then takes the place of the previous one and the cycle repeats. The population mix may converge to an Evolutionary Stable State that cannot be invaded by any mutant strategy.

4.2.7.4 Replicator Dynamics

Replicator Equation: a deterministic monotone non-linear and non-innovative game dynamic used in evolutionary game theory.

where:

x_i is the proportion of type i in the population

x = (x₁, ..., x_n) is the vector of the distribution of types in the population

f_i(x) is the fitness of type i (which is dependent on the population)

φ(x) is the average population fitness (given by the weighted average of the fitness of the n types in the population)

Evolutionarily Stable State: population's genetic composition is restored by selection after a disturbance, provided the disturbance is not too large. Such a population can be genetically monomorphic or polymorphic.
Evolutionarily Stable Strategy (ESS): a strategy which, if adopted by a population in a given environment, cannot be invaded by any alternative strategy that is initially rare.
- ESS = Nash Equilibrium + Stability Condition
  - E(S, T): the payoff for palying strategy S against stategy T
  - If S is an ESS, then for all T ≠ S
    - E(S, S) ≥ E(T, S) and E(S, T) > E(T, T)
  - Some facts:
    - ESSes and Nash equilibria often coincide.
    - Every ESS corresponds to a Nash equilibrium, but some Nash equilibria are not ESSes.
    - Some games have ESS, some don’t have.

Example: The Prisoners Dilemma

Payoff:

T>R>P>S, 2R>T+S

=> Defect is an ESS (and the only Nash equilibrium in this game)

2 genotypes in the population

Cooperators (C-types): always cooperate

Defectors (D-types): always defect

(Suppose that the proportions of cooperators and defectors in the population are initially x and 1-x respectively)

Expected Utility:

A cooperator meets another cooperator with probability x and a defector with probability 1-x and expects to earn: F(c) = x(3) + (1-x)(0)

A defector will also meet a cooperator with probability x and a defector with probability 1-x and expects to earn: F(d) = x(5) + (1-x)(1)

Selection:

The cooperators will outbreed the defectors if:

F(C) > F(D)

=> x(3) + (1-x)(0) > x(5) + (1-x)(1)

=> x < - 1

=> cannot hold

=> So the cooperators will die out

=> 100% defectors is an Evolutionary Stable State.

Example: The Hawk-Dove Game (The Game of Chicken)

The Game of Chicken: while it is to both players’ benefit if one player yields, the other player's optimal choice depends on what his opponent is doing: if his opponent yields, the player should not, but if the opponent fails to yield, the player should.

Payoff:

V: value of the contested resource

C: the cost of an escalated fight

=> No strategy is an ESS.

(There are two Nash equilibria in this game: Hawk-Dove and Dove-Hawk.)

It is (almost always) assumed that the value of the resource is less than the cost of a fight, i.e., C > V > 0. If C ≤ V, the resulting game is not a game of Chicken but is instead a Prisoner's Dilemma.

2 genotypes in the population

Hawk: always fight

Dove: share or concede

(Suppose that the proportions of Hawks and Doves in the population are initially x and 1-x respectively.)

Expected Utility:

A Hawk meets another Hawk with probability x and a Dove with probability 1-x and expects to earn: F(H) = x(V-C)/2 + (1-x)(V)

A Dove will also meet a Hawk with probability x and a Dove with probability 1-x and expects to earn: F(D) = x(0) + (1-x)(V/2)

Selection:

The Hawks will outperform the Doves if:

F(H) > F(D)

=> x(V-C)/2 + (1-x)(V) > (1-x)(V/2)

=> x < V/C

The Doves will outperform the Hawks if:

x > V/C

The Doves and the Hawks will be in balance if:

x = V/C

Analysis:

C>V>0. Otherwise the Hawks will dominate.

x = V/C is an equilibrium state, in which the fitness of the Hawks and the Doves is the same and the percentages of two genotypes remain stable.

When C >> V, the percentage of the Hawks will be at low level in the population because fighting is expensive.

When C -> V, the percentage of the Hawks will be at high level in the population because fighting is cheap.

Example: The Iterated Prisoner's Dilemma (IPD)

The Iterated Prisoner's Dilemma (IPD): two players play prisoner's dilemma more than once in succession and they remember previous actions of their opponent and change their strategy accordingly

No ESS =>

Oscillation

Converge to a stable state (depends on initial population and mutation rate)

Divergence

Strategies:

Always Defect (AllD)

Always Cooperate (AllC)

Tit-for-tat (TFT): If you cooperated, I cooperate; If you defected, I defect.

Tit-for-two-tat (TFTT)

Gram trigger (Revenge)

Win-stay, lose-shift

Random

...

5 Bayes Fundamental

5.1 Some Basic Concepts

Bayesian Probability: a formalism that allows us to reason about beliefs under conditions of uncertainty.

P(A): the probability P of an uncertain event A

P(a|K): represents a belief measure

(e.g. a person's subjective belief in a statement a will depend on some body of knowledge K)

when K remains constant we just write P(a) in simplification.

Bayes Rule:

Prior Probability (Prior) of an uncertain quantity: the probability distribution that would express one's beliefs about this quantity before some evidence(data) is taken into account.
Posterior Probability: is the probability of assigning observations to groups given the data. (i.e. the conditional probability that is assigned after the relevant evidence or background is taken into account.)

In general, we want to relate an event (E) to a hypothesis (H) and the probability of E given H.

5.2 Bayes’ theorem

6 Machine Learning & Data Mining

6.1 Machine Learning

6.1.1 What

Machine Learning: relates with the study, design and development of the algorithms that give computers the capability to learn without being explicitly programmed.

An agent is learning if it improves its performance on future tasks after making observations about the world.

6.1.2 Why

The designers cannot anticipate all possible situations that the agent might find itself in.
The designers cannot anticipate all changes over time.
Sometimes human programmers have no idea how to program a solution themselves.

6.2 Supervised Learning

6.2.1 What

Supervised Learning: the agent observes some example input- output pairs and learns a function that maps from input to output

6.2.2 How

given input samples (x) and labeled outputs (y) of a function y = f(x), “learn” f, and evaluate it on new data
- Classification: y is discrete (class labels). Learn a decision boundary that separates one class from another
- Regression: y is continuous, e.g. linear regression. Learn a continuous input-output mapping, also known as “curve fitting” and “function approximation”

F(x): function

D: labeled training sample (x, F(x))

57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 => 0 78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 => 1 69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 => 0 18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 => 0 54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 => 1

G(x): model learned from D

71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 => ?

Goal: minimise E[(T - O)2] for future use

E[ (F(x) - G(x))² ] ≈ 0

6.3 Unsupervised Learning

6.3.1 How

given only samples x of the data, infers a function f such that y = f(x) describes the hidden structure of the unlabeled data - more of an exploratory/descriptive data analysis
- Clustering: y is discrete. Learn any intrinsic structure that is present in the data
- Dimensional Reduction: y is continuous. Discover a lower- dimensional surface on which the data lives

No function

D: Unlabeled training dataset:

57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 78,M,160,1,130,100,37,40,1,0,0,0,1,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 69,F,180,0,115,85,40,22,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 18,M,165,0,110,80,41,30,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 54,F,135,0,115,95,39,35,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0 84,F,210,1,135,105,39,24,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 89,F,135,0,120,95,36,28,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0 49,M,195,0,115,85,39,32,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 40,M,205,0,115,90,37,18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 74,M,250,1,130,100,38,26,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 77,F,140,0,125,100,40,30,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1

71,M,160,1,130,105,38,20,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 => ?

Learn: ?

Goal: ?

6.4 Machine Learning Applications

Computer Vision
- Image tagging
- Self-driving car
- Image pattern recognition
Text Analysis
- Spam filtering
- Information extraction
Video Games & Robotics
- Checker/chess/go
Data Mining

6.5 Data Mining

6.5.1 What

Data Mining: the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.
- Valid: hold on new data with some certainty
- Novel: non-obvious to the system
- Useful: should be possible to act on the item
- Understandable: humans should be able to interpret the pattern

Machine Learning vs Data Mining

Machine learning: predicts with models

Data mining: explains patterns

Process of KDD

The knowledge discovery in databases (KDD) process is commonly defined with the stages:

Selection

Pre-processing

Transformation

Data mining [The core of KDD]

Interpretation/evaluation

6.5.2 Why

Commercial viewpoint:
- Banking: loan/credit card approval
  - predict good customers based on old customers
- Fraud detection: network security, financial transactions
  - use historical data to build models of fraudulent behavior
  - use data mining to help identify similar instances
- Customer relationship management (CRM)
  - Which of my customers are likely the most loyal
  - Which are most likely to leave for a competitor
  - Identify likely responders to sales promotions
Scienctific viewpoint:
- Medicine: disease outcome, effectiveness of treatments
  - analyze patient disease history & find relationship between diseases
- Astronomy: scientific data analysis
  - identify new galaxies by the stages of formation
- Web site design and promotion
  - find affinity of visitor to pages and modify layout

6.5.3 Tasks

Data mining involves six common classes of tasks:

Anomaly detection (outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation.
Association rule learning (dependency modelling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
Regression – attempts to find a function which models the data with the least error that is, for estimating the relationships among data or datasets.
Summarization – providing a more compact representation of the data set, including visualization and report generation.

In this module:

Classification: Identify to which set a new observation belongs
Clustering: Group a set of objects so objects in the same cluster are more similar
Association rule discovery: Discover interesting relations between variables in large databases ØIdentify rules using some measures

6.5.3.1 Task 1: Classification (Supervised Learing)

Classification: learn a method to predict the instance class from pre-labeled (classified) instances
Many approaches:
- Regression
  - y = ax₁ + bx₂ + c
- Decision Trees
  - divide decision space into piecewise constant regions
- Neural Networks
  - partition by non-linear boundaries
- Support vector machines
- ...
Application:
- For target marketing: reduce cost of mailing by targeting consumers who are likely to buy a
  
  new cell-phone product.

6.5.3.1.1 Approach 1: Regression

To find the best line (linear function y=f(x)) to explain the data.
Cons:
- Not flexible enough

6.5.3.1.2 Approach 2: Decision Tress

Nodes:
- Internal node: decision rule on one or more attributes
- Leaf node: a predicted class label
Pros:
- Reasonable training time
- Can handle large number of attributes
- Easy to implement
- East to interpret
Cons:
- Simple decision boundaries
- Problems with lots of missing data
- Cannot handle complicated relationship between

6.5.3.1.2 Approach 3: Neural Networks

Pros:
- Can learn more complicated class boundaries
- Can be more accurate
- Can handle large number of features
Cons:
- Hard to implement: trial and error for choosing parameters and network structure
- Slow training time
- Can over-fit the data: find patterns in random noise
- Hard to interpret

6.5.3.2 Task 2: Clustering (Unsupervised Learing)

Clustering: partition the data so the instances are grouped in similar items by using distance/similarity measure.
- A set of data points, each with a set of attributes and a similarity measure, find clusters such that
  - Data points in one cluster are more similar
  - Data points in separate clusters are less similar to one another
Methods of similarity measure:
- Euclidean or Manhattan distance
- Hamming distance
- Other problem specific measures
Approches:
- Partitioning-based clustering
  - K-means clustering
  - K-medoids clustering
- Density-based clustering
  - Separate regions of dense points by sparser regions of relatively low density
Application:
- For market segmentation: divide a market into distinct subsets of customers, any subset may be a market target

6.5.3.2.1 Partitioning-Based Clustering: K-Means

Goal: minimise sum of square of distance
- Between each point and centers of the cluster
- Between each pair of points in the cluster
Algorithm:
- Initialize K cluster centers
  - random, first K, K separated points
- Repeat until stabilization:
  - Assign each point to closest cluster center
  - Generate new cluster centers
  - Adjust clusters by merging or splitting

6.5.3.2.2 Density-Based Clustering

Cluster: a connected dense component
Density: the number of neighbors of a point

6.5.3.3 Task 3: Association Rule Discovery

Associate Rule Discovery: discover interesting relations between variables in large databases.
Associate Rule: discover dependency rules, which predict occurrence of an item based on occurrences of other items.
Application:
- market basket analysis:
  - Association rules: 60% of customers who purchase X and Y also buy Z
  - Sequential patterns: 40% of customers who first buy X also purchase Y within three weeks

7 (Artificial) Neural Networks

7.1 Neural Networks

7.1.1 What

Artificial neural networks (ANNs) or connectionist systems are computing systems vaguely inspired by the biological neural networks that constitute animal brains.
Consists of:
- A set of inputs - (dendrites)
- A set of resistances/weights – (synapses)
- A processing element - (neuron)
- A single output - (axon)

7.1.2 The First Neural Networks

y_in = x₁ * 2 + x₂ * 2 + x₃ * (-1)
For the network shown here the activation function for unit Y is
- f(y_in) = 1 (fire), if y_in >= θ; else 0 (not fire)
- where y_in is the total input signal received (weighted sum);
- θ is the threshold for Y

7.1.3 Neural Networks Architectures

Three different classes of network architectures
- single-layer feed-forward
- multi-layer feed-forward
- recurrent

7.1.4 Linear Separability

Linearly Separability: classes of patterns with -dimensional vector can be separated with a single decision surface
- Only linearly Separable functions can be represented by a single layer NN (perceptron)
  
  Perceptron: synonym for Single-Layer, Feed-Forward Network

7.1.5 Training a NN

For an AND Gate:

While epoch produces an error
	check next inputs (pattern) from epoch
	Err = T - O
	If Err <> 0 then
		w_i = w_i + LR * x_i * Err
	End If
End While

Epoch: the entire training set feed into the neural network
- e.g. The AND function's epoch: ([0,0], [0,1], [1,0], [1,1])
Training Value (T): the value we require the network to produce
- e.g. the T of [1, 0] is 0
Error (Err): the amout the output by the network O differs from the training value T
x_i: inputs to the neuron
w_i: weight from input x_i to the output
Learning Rate (LR): how quickly the network converges
- it's set by experimentation, typically 0.1