Artificial Collective Intelligence

sss
Artificial Collective Intelligence
Dr. Jun Wang, UCL

Deep Reinforcement learning
• Computerised agent: Learning what to do
– How to map situations (states) to actions so as to
maximise a numerical reward signal
Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998.

Human-level Control
http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html#videos
>75% human level

AlphaGo vs. the world’s ‘Go’ champion
Coulom, Rémi. "Whole-history rating: A bayesian rating system for
players of time-varying strength." Computers and games. Springer
Berlin Heidelberg, 2008. 113-124.
http://www.goratings.org/
Last
year
rating
list
https://deepmind.com/research/alphago/alphago-china/

What is next?
• All above are single AI unit
• But, true human intelligence
embraces social and collective
wisdom
– collective efforts would solve the
problem otherwise unthinkable e.g., esp
game. Crowdsourcing
• A next grand challenge of AI
– How large-scale multiple AI agents
could learn human-level collaborations
(or competitions) from their experiences?

• Huge applications space
– Trading robots gaming on the stock
markets,
– Ad bidding agents competing with each
other over online advertising exchanges
– E-commerce collaborative filtering
recommenders predicting user interests
through the wisdom of the crowd
– Traffic control
– Self-driving car
– Creativity learning (generative txts,
images, music, poetry)
– …

Summary
• Learning to compete
– Designing game environment
– Machine Bidding in auction
– Creativity learning (generating texts, images,
music, poetry)
• Learning to collaborate
– AI plays StarCraft game

Controllable Environments
in Deep Reinforcement learning
• In a typical RL setting: environment is
unknown yet fixed.
Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998.

Controllable Environments
• We consider the environment is controllable
and strategic
• A mini-max game between the agent and the
environment
Haifeng, Zhang, et al, Learning to Generate (Adversarial) Environments in Deep
Reinforcement Learning, under submission, 2017
1. Generate
Environments
2. Each environment
trains an agent
3. Operate in the
environments with
4. Agent return
G ...G1 6
Agent
πµ
θ
θ
A
Environment
Generator
M
ϕw
θ1
A
M θ2
A
M θ3
A
M
θ4
A
M θ5
A
M θ6
A
M
respectively...πϕ1
πϕ6
generator update
guide the
1: Framework dealing with non-differentiable transitions. Generator generates environmen
ter ✓. For each ✓, agents are trained until optimal policies are obtained. Then agents are teste
esponding environments and returns are observed, which ﬁnally guide the generator to updat
olution for Undifferentiable Transition
gh we have proved the equivalence between the transition optimization and the policy o
In this paper, we consider a particular objective of MDP that the MDP acts as an83
environment minimizing the expected return of the agent, i.e. O(H) =
P1
t=1
t
84
Thus, the objective function is formulated as:85
✓⇤
= arg min
✓
max E[G|⇡ ; M✓ = hS, A, P✓, R, i].
This adversarial objective can be applied to design environments to analyse the weakness86
and its policy learning algorithms.87

Controllable Environments:
An example
• Maze:
• Agent: try to find an optimal strategy to find the way
out.
• Environment: generate a Maze to make it difficult to
find a way
Haifeng, Zhang, et al, Learning to Generate (Adversarial) Environments in Deep
Reinforcement Learning, under submission, 2017

Design Maze: Results
Haifeng, Zhang, et al, Learning to Generate (Adversarial) Environments in Deep Reinforcement Learning, under submission, 2017
DFS
DQNOptimal
RHS

Four auctions
• “Open cry” auctions
1. English Auctions
2. Dutch Auctions
• “Sealed bid” auctions
3. 1st-price/”pay-your-bid”
auctions
4. 2nd-price/Vickrey auctions
$2 $3$8 $5

Auctions scheme
v1
v2
v3
v4
b1
b2
b3
b4
private values bids
winner
payments $$$

Machine Bidding
v1
v2
v3
v4
b1
b2
b3
b4
private values bids
winner
payments $$$

Online Advertising + Artificial Intelligence
• Design learning algorithms to make the best match
between the advertisers and Internet users with
economic constraints
•Transformed from a low-tech process to highly optimized, mathematical, computer-centric (Wall
Street-like) process
• Key directions: operations research, estimating CTR/AR; auction systems; machine learning
algorithms; behavioral targeting; fighting spam (click fraud)

(User targeting dominates the context)

RTB Display Advertising Mechanism
• Buying ads via real-time bidding (RTB), 10B per day
RTB
Ad
Exchange
Demand-Side
Platform
Advertiser
Data
Management
Platform
0. Ad Request
1. Bid Request
(user, page, context)
2. Bid Response
(ad, bid price)
3. Ad Auction
4. Win Notice
(charged price)
5. Ad
(with tracking)
6. User Feedback
(click, conversion)
User Information
User Demography:
Male, 26, Student
User Segmentations:
London, travelling
Page
User
<100 ms
[Zhang et al. Optimal real-time bidding for display advertising. KDD 14]

Can we have a dynamic model?
Bidding in RTB as an RL problem
Advertiser
with ad budget
Environment
auction result，
user response
bid request
xt+1
bid request xt bid price at
• From the perspective of an advertiser with budget, sequentially bidding
in RTB is a reinforcement learning (RL) problem.
• The goal is to maximize the user responses on the displayed ads.
Cai, H., K. Ren, W. Zhag, K. Malialis, and J. Wang. "Real-Time Bidding by Reinforcement Learning in Display Advertising."
In The Tenth ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2017.

MDP Formulation of RTB
Environment
[s] left auction 𝑻
[s] left budget 𝑩 𝑻
2. [a]
bid 𝒂
1. [s] bid
request 𝒙 𝑻
3. [p] auction result
3. [r] user response
[s] left auction 𝑻 − 𝟏
[s] left budget 𝑩 𝑻'𝟏
[s] left auction 𝟎
[s] left budget 𝑩 𝟎
next episode
• Consider bidding in RTB as an episodic process.
[s] state [a] action [p] state transition [r] reward
Cai, H., K. Ren, W. Zhag, K. Malialis, and J. Wang. "Real-Time Bidding by Reinforcement Learning in Display Advertising."
In The Tenth ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2017.

Summary
– Designing game environment
– Machine bidding in auction
music, poetry)

Generative Models
• Classic machine learning tasks (label prediction)
• Generation tasks (generating actual data)
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
rial Nets with Labeled Data by Activation Maximization
(a) Real Images (b) Generated Images
Figure 5. MNIST results.
0
1
2
3
4
5
6
7
8
9
recognition
770
771
772
773
774
775
776
777
778
779
Generative Adversarial Nets with Labeled Data by Activation Maximization
(a) Real Images (b) Generat
0
1
2
3
4
5
6
7
8
9
generation
High-dimension ->
low dimension
low-dimension ->
high-dimension

Generative Adversarial Nets (GANs)
• Minimax game between a discriminator & a generator:
– Discriminator (D) tries to correctly distinguish the true data and the
fake model-generated data
– Generator (G) tries to generate high-quality data to fool discriminator
• G & D can be implemented via neural networks
• Ideally, when D cannot distinguish the true and generated data,
G nicely fits the true underlying data distribution
[Goodfellow I, Pouget-Abadie J, Mirza M, et al. 2014. Generative adversarial nets. In NIPS 2014.]

Labeled Generative Adversarial Nets
• Discriminator is multi-class classifier trained with labelled
data
G D
Generator Discriminator
Sample
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
tivation Maximization
Real Images (b) Generated Images
825
826
827
828
829
830
831
832
833
834
835
836
837
on Maximization
mages (b) Generated Images
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
Activation Maximization
) Real Images (b) Generated Images
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
by Activation Maximization
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
Maximization
es (b) Generated Images
ure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
a by Activation Maximization
825
826
827
828
829
830
831
832
833
834
835
836
837
838
ation Maximization
Images (b) Generated Images
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
aximization
(b) Generated Images
5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
mization
MNIST results.
G
Loss
hLabeledDatabyActivationMaximization
m-
li-
ng
od
nd
ass
Be-
ng
he
he
ny
(4)
)i
Class1Class2
GeneratedSample
FinalGradient
forG
Gradient1Gradient2
Figure1.TheproblemofoverlayedgradientofLabGAN(Sal-
imansetal.,2016)frommulti-moderealdata.Weassumethe
logitisbuiltbasedonthedistancebetweenthegradientsample
andtheclasscenter.
where
↵lab
k(x)=
(Dk(x)
Dr(x)
k2{1,...,K}
1k=K+1
.(8)
Fromtheformulation,weseethattheoverallgradientw.r.t
generatedexamplexis(1Dr(x)).Thisisconsistent
withtheoriginalGAN(Goodfellowetal.,2014)whenno
labelinformationisgiven.Thegradientonrealisthen
furtherdistributedtoeachrealclasslogitaccordingtoits
Averaged Loss
from predicted labels
[Zhiming Zhou, Shu Rong, Han Cai, Weinan Zhang, Yong Yu, Jun Wang Generative Adversarial Nets with Labeled Data
by Activation Maximization, 2017 ]

Activation Maximisation
Generative Adversarial Nets
• Discriminator is multi-class classifier trained with labelled
data
G D
Generator Discriminator
Sample
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
tivation Maximization
Real Images (b) Generated Images
825
826
827
828
829
830
831
832
833
834
835
836
837
on Maximization
mages (b) Generated Images
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
by Activation Maximization
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
Maximization
es (b) Generated Images
ure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
a by Activation Maximization
825
826
827
828
829
830
831
832
833
834
835
836
837
838
ation Maximization
Images (b) Generated Images
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
aximization
5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
mization
MNIST results.
G
Loss
hLabeledDatabyActivationMaximization
m-
li-
ng
od
nd
ass
Be-
ng
he
he
ny
(4)
)i
Class1Class2
GeneratedSample
FinalGradient
forG
Gradient1Gradient2
Figure1.TheproblemofoverlayedgradientofLabGAN(Sal-
imansetal.,2016)frommulti-moderealdata.Weassumethe
logitisbuiltbasedonthedistancebetweenthegradientsample
andtheclasscenter.
where
↵lab
k(x)=
(Dk(x)
Dr(x)
k2{1,...,K}
1k=K+1
.(8)
Fromtheformulation,weseethattheoverallgradientw.r.t
generatedexamplexis(1Dr(x)).Thisisconsistent
withtheoriginalGAN(Goodfellowetal.,2014)whenno
labelinformationisgiven.Thegradientonrealisthen
furtherdistributedtoeachrealclasslogitaccordingtoits
[Zhiming Zhou, Shu Rong, Han Cai, Weinan Zhang, Yong Yu, Jun Wang Generative Adversarial Nets with Labeled Data
by Activation Maximization, 2017 ]
Activation
Maximised

GAN with Activation Maximisation
[Zhiming Zhou, Shu Rong, Han Cai, Weinan Zhang, Yong Yu, Jun Wang Generative Adversarial Nets with Labeled Data by Activation Maximization, 2017 ]
ed Data by Activation Maximization
Class 1 Class 2
Generated Sample
Final Gradient
for G
Gradient 1 Gradient 2
Figure 1. The problem of overlayed gradient of LabGAN (Sal-
imans et al., 2016) from multi-mode real data. We assume the
logit is built based on the distance between the gradient sample
and the class center.
where
↵lab
k (x) =
(Dk(x)
Dr(x)
k 2 {1, . . . , K}
1 k = K+1
. (8)
From the formulation, we see that the overall gradient w.r.t
generated example x is (1 Dr(x)). This is consistent
with the original GAN (Goodfellow et al., 2014) when no
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
LabGAN Iteration:50k NLL:17.86 LabGAN Iteration:150k NLL:17.11 LabGAN Iteration:200k NLL:16.71
SAM-GAN Iteration:50k NLL:17.66 SAM-GAN Iteration:150k NLL:15.94 SAM-GAN Iteration:200k NLL:15.79
Real data p.d.f.
Gen. data
Figure 2. The generated examples along with the true density
distribution on synthetic data.
Figure 3. Training iterations on the synthetic data measured with
NNL by Oracle.
Iterations
truck
ship
hourse
frog
dog
deer
cat
bird
automobile
airplane
5，000
5.79
8.31 8.55 8.74 8.84 9.20 9.29
6.90 7.74 8.01 8.17
10,000 15,000 30,000 150,000 300,000
Inception
AM score
score 8.34
Figure 4. CIFAR-10 progress results.

SeqGAN – Sequence generation
• Generator is a reinforcement learning policy generating a sequence
– decide the next word to generate (action) given the previous ones as
the state
• Discriminator provides the reward (i.e. the probability of being true
data) for the whole sequence
Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. AAAI 2017.

Experiments on Synthetic Data
• Evaluation measure with Oracle
Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. AAAI 2017.

Experiments on Real-World Data
• Chinese poem generation
南陌春风早，东邻去日斜。
紫陌追随日，青门相见时。
胡风不开花，四气多作雪。
山夜有雪寒，桂里逢客时。
此时人且饮，酒愁一节梦。
四面客归路，桂花开青竹。
Human Machine

Obama Speech Text Generation
• i stood here today i have one and
most important thing that not on
violence throughout the horizon
is OTHERS american fire and
OTHERS but we need you are a
strong source
• for this business leadership will
remember now i can’t afford to
start with just the way our
european support for the right
thing to protect those american
story from the world and
• i want to acknowledge you were
going to be an outstanding job
times for student medical
education and warm the
republicans who like my times if
he said is that brought the
• When he was told of this
extraordinary honor that he
was the most trusted man in
America
• But we also remember and
celebrate the journalism that
Walter practiced -- a standard
of honesty and integrity and
responsibility to which so many
of you have committed your
careers. It's a standard that's a
little bit harder to find today
• I am honored to be here to pay
tribute to the life and times of
the man who chronicled our
time.
Human Machine

Summary
– Machine Bidding in auction
music, poetry)

AI plays StarCraft
• One of the most difficult games for computers
• At least 101685 possible states (for reference, the game of Go has about
10170 states)!
• how large-scale multiple AI agents could learn human-level
collaborations, or competitions, from their experiences?

Bidirectional-Coordinated nets (BiCNet)
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally-
Coordinated Nets for Learning to Play StarCraft Combat Games, 2017

Unsupervised training without human demonstration
and labelled data

Coordinated moves without collision
• The first two (a) and (b) illustrate that the collision happens when the
agents are close by during the early stage of the training;
• the last two (c) and (d) illustrate coordinated moves over the well-trained
agents
(a) Early stage of training (b) Early stage of training (c) Well-trained (d) Well-trained
Figure 2: Coordinated moves without collision in combat 3 Marines (ours) vs. 1 Super Zergling
(enemy). The ﬁrst two (a) and (b) illustrate that the collision happens when the agents are close by
during the early stage of the training; the last two (c) and (d) illustrate coordinated moves over the
well-trained agents.
Combat 3 Marines (ours) vs. 1 Super Zergling (enemy)

“Hit and Run” tactics
combat 3 Marines (ours) vs. 1 Zealot (enemy)
(a) Early stage of training (b) Early stage of training (c) Well-trained (d) Well-trained
Figure 2: Coordinated moves without collision in combat 3 Marines (ours) vs. 1 Super Zergling
(enemy). The first two (a) and (b) illustrate that the collision happens when the agents are close by
during the early stage of the training; the last two (c) and (d) illustrate coordinated moves over the
well-trained agents.
(a) time step 1: run when
attacked
(b) time step 2: fight back
when safe
(c) time step 3: run again
Attack
Move
Enemy
(d) time step 4: fight back
again
Figure 3: Hit and Run tactics in combat 3 Marines (ours) vs. 1 Zealot (enemy).
efficiently propagated through the entire networks. Yet, unlike CommNet [20], our communication is
not fully symmetric, and we maintain certain social conventions and roles by fixing the order of the
agents that join the RNN. This would help solving any possible tie between multiple optimal joint
actions [35, 36].
The structure of our bidirectionally-coordinated net (BiCNet) is illustrated in Fig. 1. It consists of

Coordinated moves without collision
Combat 3 Marines (ours) vs. 1 Zergling (enemy)
(a) time step 1 (b) time step 2 (c) time step 3
Attack
Move
Enemy
(d) time step 4
Figure 4: Coordinated cover attack in combat 3 Marines (ours) vs. 1 Zergling (enemy).
Table 1: Winning rate against difﬁculty settings by hit points (HP) and damage. Training steps:
100k/200k/300k.
Difﬁculty
Damage=4 Damage=3

Focus fire
combat 15 Marines (ours) vs. 16 Marines (enemy)
(a) time step 1 (b) time step 2 (c) time step 3
Attack
Move
(d) time step 4
Figure 5: "focus ﬁre" in combat 15 Marines (ours) vs. 16 Marines (enemy).

Coordinated heterogeneous agents
combat 2 Dropships and 2 tanks vs. 1 Ultralisk
(a) time step 1 (b) time step 2 (c) time step 3 (d) time step 4
Figure 5: "focus fire" in combat 15 Marines (ours) vs. 16 Marines (enemy).
(a) time step 1
Attack
Enemy
Load
Unload
(b) time step 2
igure 6: Coordinated heterogeneous agents in combat 2 Dropships and 2 tanks vs. 1 Ultralisk
ver way. Neither scattering over all enemies nor focusing on one enemy (wasting attacking fi
lso called overkill) are desired. The grouping design in the policy network serves as the
or for BiCNet to learn “focus fire without overkill”. In our experiments, we dynamically gro
agents based on agents’ geometric locations. Based on the grouping inputs, BiCNet manage

AI playing StarCraft demo
in collaboration with Alibaba group

Building a persona:
Freud's model of the human mind
• the id is the primitive and
instinctual part of the
mind that contains sexual
and aggressive drives and
hidden memories
• the super-ego operates as
a moral conscience;
• the ego is the realistic part
that mediates between
the desires of the id and
the super-ego https://en.wikipedia.org
意
识
潜意识

Reinforcement Learning with 1 millions agents
Q-network
Experience
Buffer
(Obs, ID)
Q-value
(Obs, ID)
Q-value
(Obs, ID)
Q-value
(Obs, ID)
Q-value
(st, at, rt, st+1)
updates action
ID embedding
action
reward
action
reward
reward
...
...
(st, at, rt, st+1)
(st, at, rt, st+1)
1
2
3 4
6 5
Figure 2: Million-agent Q-learning in Predator-prey World.
borating with others. We keep alternating the environments by feeding these two
Yaodong Yang et al, An Empirical Study of Collective Behaviors in Many-agent Reinforcement Learning, submitted, 2017

Artificial Population vs Real Population
Yaodong Yang et al, An Empirical Study of Collective Behaviors in Many-agent Reinforcement Learning, submitted, 2017

Thanks for your attention
http://www.thisisbarry.com/single-post/2015/12/28/The-Thirteenth-Floor-1999-Explained

Artificial Collective Intelligence

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Artificial Collective Intelligence

Similaire à Artificial Collective Intelligence (20)

Plus de Jun Wang

Plus de Jun Wang (10)

Dernier

Dernier (20)

Artificial Collective Intelligence