SlideShare une entreprise Scribd logo
1  sur  47
Télécharger pour lire hors ligne
sss
Artificial Collective Intelligence
Dr. Jun Wang,	UCL
Deep Reinforcement	learning
• Computerised agent:	Learning	what	to	do	
– How	to	map	situations	(states)	to	actions so	as	to	
maximise a	numerical	reward	signal
Sutton,	Richard	S.,	and	Andrew	G.	Barto.	Reinforcement	learning:	An	introduction.	MIT	press,	1998.
Human-level Control
http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html#videos
>75%	human	level
AlphaGo vs.	the	world’s	‘Go’	champion
Coulom,	Rémi.	"Whole-history	rating:	A	bayesian rating	system	for	
players	of	time-varying	strength."	Computers	and	games.	Springer	
Berlin	Heidelberg,	2008.	113-124.
http://www.goratings.org/
Last
year
rating
list
https://deepmind.com/research/alphago/alphago-china/
What is next?
• All above are single AI unit
• But, true	human	intelligence	
embraces	social	and	collective	
wisdom
– collective efforts	would	solve	the	
problem	otherwise	unthinkable e.g., esp
game. Crowdsourcing
• A	next	grand	challenge	of	AI
– How	large-scale	multiple	AI	agents	
could	learn	human-level	collaborations
(or	competitions) from	their	experiences?
What is next?
• All above are single AI unit
• But, true	human	intelligence	
embraces	social	and	collective	
wisdom
– collective efforts	would	solve	the	
problem	otherwise	unthinkable e.g., esp
game. Crowdsourcing
• A	next	grand	challenge	of	AI
– How	large-scale	multiple	AI	agents	
could	learn	human-level	collaborations
(or	competitions) from	their	experiences?
Artificial Collective Intelligence
Artificial	Collective	Intelligence
• Huge applications space	
– Trading	robots	gaming	on	the	stock	
markets,	
– Ad	bidding	agents	competing	with	each	
other	over	online	advertising	exchanges
– E-commerce	collaborative	filtering	
recommenders predicting	user	interests	
through	the	wisdom	of	the	crowd
– Traffic control
– Self-driving car
– Creativity learning (generative txts,
images, music, poetry)
– …
Summary
• Learning to compete
– Designing	game	environment
– Machine	Bidding	in	auction
– Creativity learning (generating texts, images,
music, poetry)
• Learning to collaborate
– AI plays StarCraft game
Summary
• Learning to compete
– Designing	game	environment
– Machine	Bidding	in	auction
– Creativity learning (generating texts, images,
music, poetry)
• Learning to collaborate
– AI plays StarCraft game
Controllable	Environments	
in	Deep Reinforcement	learning
• In	a	typical	RL	setting:	environment	is	
unknown	yet	fixed.
Sutton,	Richard	S.,	and	Andrew	G.	Barto.	Reinforcement	learning:	An	introduction.	MIT	press,	1998.
Controllable	Environments
• We	consider	the	environment	is	controllable	
and	strategic
• A	mini-max	game	between	the	agent	and	the	
environment			
Haifeng, Zhang, et al, Learning to Generate (Adversarial) Environments in Deep
Reinforcement Learning, under submission, 2017
1. Generate
Environments
2. Each environment
trains an agent
3. Operate in the
environments with
4. Agent return
G ...G1 6
Agent
πµ
θ
θ
A
Environment
Generator
M
ϕw
θ1
A
M θ2
A
M θ3
A
M
θ4
A
M θ5
A
M θ6
A
M
respectively...πϕ1
πϕ6
generator update
guide the
1: Framework dealing with non-differentiable transitions. Generator generates environmen
ter ✓. For each ✓, agents are trained until optimal policies are obtained. Then agents are teste
esponding environments and returns are observed, which finally guide the generator to updat
olution for Undifferentiable Transition
gh we have proved the equivalence between the transition optimization and the policy o
In this paper, we consider a particular objective of MDP that the MDP acts as an83
environment minimizing the expected return of the agent, i.e. O(H) =
P1
t=1
t
84
Thus, the objective function is formulated as:85
✓⇤
= arg min
✓
max E[G|⇡ ; M✓ = hS, A, P✓, R, i].
This adversarial objective can be applied to design environments to analyse the weakness86
and its policy learning algorithms.87
Controllable	Environments:
An	example
• Maze:	
• Agent:	try	to	find	an	optimal	strategy	to	find	the	way	
out.	
• Environment:		generate	a	Maze	to	make	it	difficult	to	
find	a	way
Haifeng, Zhang, et al, Learning to Generate (Adversarial) Environments in Deep
Reinforcement Learning, under submission, 2017
Design	Maze:	Results
Haifeng, Zhang, et al, Learning to Generate (Adversarial) Environments in Deep Reinforcement Learning, under submission, 2017
DFS
DQNOptimal
RHS
Summary
• Learning to compete
– Designing	game	environment
– Machine	Bidding	in	auction
– Creativity learning (generating texts, images,
music, poetry)
• Learning to collaborate
– AI plays StarCraft game
Four	auctions
• “Open	cry”	auctions
1.	English	Auctions	
2.	Dutch	Auctions
• “Sealed	bid”	auctions
3. 1st-price/”pay-your-bid”	
auctions
4.	2nd-price/Vickrey auctions
$2 $3$8 $5
Auctions	scheme
v1
v2
v3
v4
b1
b2
b3
b4
private	values bids
winner
payments $$$
Machine Bidding
v1
v2
v3
v4
b1
b2
b3
b4
private	values bids
winner
payments $$$
Online Advertising + Artificial Intelligence
• Design	learning algorithms to	make	the	best	match	
between	the	advertisers	and	Internet	users	with	
economic	constraints
•Transformed	from	a	low-tech	process	to	highly	optimized,	mathematical,	computer-centric	(Wall	
Street-like)	process	
• Key	directions:	operations	research,	estimating	CTR/AR;	auction	systems;	machine learning	
algorithms;	behavioral	targeting;	fighting	spam	(click	fraud)
(User	targeting	dominates	the	context)
RTB	Display	Advertising	Mechanism
• Buying	ads	via	real-time	bidding	(RTB),	10B	per	day
RTB
Ad
Exchange
Demand-Side
Platform
Advertiser
Data
Management
Platform
0.	Ad	Request
1.	Bid	Request
(user,	page,	context)
2.	Bid	Response
(ad,	bid	price)
3.	Ad	Auction
4.	Win	Notice
(charged	price)
5.	Ad
(with	tracking)
6.	User	Feedback
(click,	conversion)
User	Information
User	Demography:	
Male,	26,	Student
User	Segmentations:
London,	travelling	
Page
User
<100	ms
[Zhang	et	al.	Optimal	real-time	bidding	for	display	advertising.	KDD	14]
Can we have a dynamic model?															
Bidding	in	RTB	as	an	RL	problem
Advertiser
with ad budget
Environment
auction	result,
user	response
bid	request	
xt+1
bid	request	xt bid	price	at
• From	the	perspective	of	an	advertiser	with	budget,	sequentially	bidding	
in	RTB	is	a	reinforcement	learning	(RL)	problem.
• The	goal	is	to	maximize	the	user	responses	on	the	displayed	ads.
Cai, H., K. Ren, W. Zhag, K. Malialis, and J. Wang. "Real-Time Bidding by Reinforcement Learning in Display Advertising."
In The Tenth ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2017.
MDP	Formulation	of	RTB
Environment
[s]	left	auction 𝑻
[s]	left	budget	𝑩 𝑻
2.	[a]	
bid	𝒂
1.	[s]	bid	
request	𝒙 𝑻
3.	[p]	auction	result
3.	[r]	user	response
[s]	left	auction 𝑻 − 𝟏
[s]	left	budget	𝑩 𝑻'𝟏
[s]	left	auction 𝟎
[s]	left	budget	𝑩 𝟎
next	episode
• Consider	bidding	in	RTB	as	an	episodic	process.
[s]	state	[a]	action [p]	state	transition	[r]	reward
Cai, H., K. Ren, W. Zhag, K. Malialis, and J. Wang. "Real-Time Bidding by Reinforcement Learning in Display Advertising."
In The Tenth ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2017.
Summary
• Learning to compete
– Designing	game	environment
– Machine bidding	in	auction
– Creativity learning (generating texts, images,
music, poetry)
• Learning to collaborate
– AI plays StarCraft game
Generative	Models
• Classic	machine	learning	tasks	(label	prediction)
• Generation	tasks	(generating	actual	data)
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
rial Nets with Labeled Data by Activation Maximization
(a) Real Images (b) Generated Images
Figure 5. MNIST results.
0
1
2
3
4
5
6
7
8
9
recognition
770
771
772
773
774
775
776
777
778
779
Generative Adversarial Nets with Labeled Data by Activation Maximization
(a) Real Images (b) Generat
0
1
2
3
4
5
6
7
8
9
generation
High-dimension	->	
low	dimension
low-dimension	->	
high-dimension
Generative models
Generative	Adversarial	Nets	(GANs)
• Minimax game between a discriminator & a generator:
– Discriminator (D)	tries	to	correctly	distinguish	the	true	data	and	the	
fake	model-generated	data
– Generator (G)	tries	to	generate	high-quality	data	to	fool	discriminator
• G	&	D	can	be	implemented	via	neural	networks
• Ideally,	when	D	cannot	distinguish	the	true	and	generated	data,	
G	nicely	fits	the	true	underlying	data	distribution
[Goodfellow I,	Pouget-Abadie J,	Mirza	M,	et	al. 2014.	Generative	adversarial	nets.	In	NIPS	2014.]
Labeled	Generative	Adversarial	Nets
• Discriminator is	multi-class	classifier	trained	with	labelled	
data
G D
Generator Discriminator
Sample
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
tivation Maximization
Real Images (b) Generated Images
Figure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
on Maximization
mages (b) Generated Images
Figure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
Activation Maximization
) Real Images (b) Generated Images
Figure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
by Activation Maximization
(a) Real Images (b) Generated Images
Figure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
Activation Maximization
) Real Images (b) Generated Images
Figure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
Maximization
es (b) Generated Images
ure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
a by Activation Maximization
(a) Real Images (b) Generated Images
Figure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
ation Maximization
Images (b) Generated Images
Figure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
aximization
(b) Generated Images
5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
mization
(b) Generated Images
MNIST results.
G
Loss
hLabeledDatabyActivationMaximization
m-
li-
ng
od
nd
ass
Be-
ng
he
he
ny
(4)
)i
Class1Class2
GeneratedSample
FinalGradient
forG
Gradient1Gradient2
Figure1.TheproblemofoverlayedgradientofLabGAN(Sal-
imansetal.,2016)frommulti-moderealdata.Weassumethe
logitisbuiltbasedonthedistancebetweenthegradientsample
andtheclasscenter.
where
↵lab
k(x)=
(Dk(x)
Dr(x)
k2{1,...,K}
1k=K+1
.(8)
Fromtheformulation,weseethattheoverallgradientw.r.t
generatedexamplexis(1Dr(x)).Thisisconsistent
withtheoriginalGAN(Goodfellowetal.,2014)whenno
labelinformationisgiven.Thegradientonrealisthen
furtherdistributedtoeachrealclasslogitaccordingtoits
Averaged	Loss	
from	predicted	labels
[Zhiming Zhou,	Shu	Rong,	Han	Cai,	Weinan Zhang,	Yong	Yu,	Jun	Wang	Generative	Adversarial	Nets	with	Labeled	Data	
by	Activation	Maximization, 2017	]
Activation	Maximisation
Generative	Adversarial	Nets
• Discriminator is	multi-class	classifier	trained	with	labelled	
data
G D
Generator Discriminator
Sample
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
tivation Maximization
Real Images (b) Generated Images
Figure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
on Maximization
mages (b) Generated Images
Figure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
Activation Maximization
) Real Images (b) Generated Images
Figure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
by Activation Maximization
(a) Real Images (b) Generated Images
Figure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
Activation Maximization
) Real Images (b) Generated Images
Figure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
Maximization
es (b) Generated Images
ure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
a by Activation Maximization
(a) Real Images (b) Generated Images
Figure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
ation Maximization
Images (b) Generated Images
Figure 5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
aximization
(b) Generated Images
5. MNIST results.
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
mization
(b) Generated Images
MNIST results.
G
Loss
hLabeledDatabyActivationMaximization
m-
li-
ng
od
nd
ass
Be-
ng
he
he
ny
(4)
)i
Class1Class2
GeneratedSample
FinalGradient
forG
Gradient1Gradient2
Figure1.TheproblemofoverlayedgradientofLabGAN(Sal-
imansetal.,2016)frommulti-moderealdata.Weassumethe
logitisbuiltbasedonthedistancebetweenthegradientsample
andtheclasscenter.
where
↵lab
k(x)=
(Dk(x)
Dr(x)
k2{1,...,K}
1k=K+1
.(8)
Fromtheformulation,weseethattheoverallgradientw.r.t
generatedexamplexis(1Dr(x)).Thisisconsistent
withtheoriginalGAN(Goodfellowetal.,2014)whenno
labelinformationisgiven.Thegradientonrealisthen
furtherdistributedtoeachrealclasslogitaccordingtoits
[Zhiming Zhou,	Shu	Rong,	Han	Cai,	Weinan Zhang,	Yong	Yu,	Jun	Wang	Generative	Adversarial	Nets	with	Labeled	Data	
by	Activation	Maximization, 2017	]
Activation	
Maximised
GAN	 with Activation Maximisation
[Zhiming Zhou,	Shu	Rong,	Han	Cai,	Weinan Zhang,	Yong	Yu,	Jun	Wang	Generative	Adversarial	Nets	with	Labeled	Data	by	Activation	Maximization, 2017	]
ed Data by Activation Maximization
Class 1 Class 2
Generated Sample
Final Gradient
for G
Gradient 1 Gradient 2
Figure 1. The problem of overlayed gradient of LabGAN (Sal-
imans et al., 2016) from multi-mode real data. We assume the
logit is built based on the distance between the gradient sample
and the class center.
where
↵lab
k (x) =
(Dk(x)
Dr(x)
k 2 {1, . . . , K}
1 k = K+1
. (8)
From the formulation, we see that the overall gradient w.r.t
generated example x is (1 Dr(x)). This is consistent
with the original GAN (Goodfellow et al., 2014) when no
Generative Adversarial Nets with Labeled Data by Activation Maximization
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
LabGAN Iteration:50k NLL:17.86 LabGAN Iteration:150k NLL:17.11 LabGAN Iteration:200k NLL:16.71
SAM-GAN Iteration:50k NLL:17.66 SAM-GAN Iteration:150k NLL:15.94 SAM-GAN Iteration:200k NLL:15.79
Real data p.d.f.
Gen. data
Figure 2. The generated examples along with the true density
distribution on synthetic data.
Figure 3. Training iterations on the synthetic data measured with
NNL by Oracle.
Iterations
truck
ship
hourse
frog
dog
deer
cat
bird
automobile
airplane
5,000
5.79
8.31 8.55 8.74 8.84 9.20 9.29
6.90 7.74 8.01 8.17
10,000 15,000 30,000 150,000 300,000
Inception
AM score
score 8.34
Figure 4. CIFAR-10 progress results.
Generative Adversarial Nets with Labeled Data by Activation Maximization
(a) Real Images (b) Generated Images
Figure 5. MNIST results.
SeqGAN – Sequence generation
• Generator	is	a	reinforcement	learning	policy generating	a	sequence
– decide	the	next	word to	generate (action)	given	the	previous	ones as
the state
• Discriminator	provides	the	reward	(i.e.	the	probability	of	being	true	
data) for	the	whole	sequence
Lantao Yu,	Weinan Zhang,	Jun	Wang,	Yong	Yu.	SeqGAN:	Sequence	Generative	Adversarial	Nets	with	Policy	Gradient.	AAAI	2017.
Experiments	on	Synthetic	Data
• Evaluation	measure	with	Oracle
Lantao Yu,	Weinan Zhang,	Jun	Wang,	Yong	Yu.	SeqGAN:	Sequence	Generative	Adversarial	Nets	with	Policy	Gradient.	AAAI	2017.
Experiments	on	Real-World	Data
• Chinese	poem	generation
南陌春风早,东邻去日斜。
紫陌追随日,青门相见时。
胡风不开花,四气多作雪。
山夜有雪寒,桂里逢客时。
此时人且饮,酒愁一节梦。
四面客归路,桂花开青竹。
Human Machine
Obama	Speech	Text	Generation
• i stood	here	today	i have	one	and	
most	important	thing	that	not	on	
violence	throughout	the	horizon	
is	OTHERS	american fire	and	
OTHERS	but	we	need	you	are	a	
strong	source
• for	this	business	leadership	will	
remember	now	i can’t	afford	to	
start	with	just	the	way	our	
european support	for	the	right	
thing	to	protect	those	american
story	from	the	world	and
• i want	to	acknowledge	you	were	
going	to	be	an	outstanding	job	
times	for	student	medical	
education	and	warm	the	
republicans	who	like	my	times	if	
he	said	is	that	brought	the
• When	he	was	told	of	this	
extraordinary	honor	that	he	
was	the	most	trusted	man	in	
America
• But	we	also	remember	and	
celebrate	the	journalism	that	
Walter	practiced	-- a	standard	
of	honesty	and	integrity	and	
responsibility	to	which	so	many	
of	you	have	committed	your	
careers.		It's	a	standard	that's	a	
little	bit	harder	to	find	today
• I	am	honored	to	be	here	to	pay	
tribute	to	the	life	and	times	of	
the	man	who	chronicled	our	
time.
Human Machine
Summary
• Learning to compete
– Machine	Bidding	in	auction
– Creativity learning (generating	texts, images,
music, poetry)
• Learning to collaborate
– AI plays StarCraft game
AI plays StarCraft
• One	of	the	most	difficult	games	for	computers
• At	least	101685 possible	states	(for	reference,	the	game	of	Go	has	about	
10170 states)!
• how	large-scale	multiple	AI	agents	could	learn	human-level	
collaborations,	or	competitions,	from	their	experiences?
Bidirectional-Coordinated nets (BiCNet)
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally-
Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
Unsupervised training without human demonstration
and labelled data
Coordinated	moves	without	collision
• The	first	two	(a)	and	(b)	illustrate	that	the	collision	happens	when	the	
agents	are	close	by	during	the	early	stage	of	the	training;	
• the	last	two	(c)	and	(d)	illustrate	coordinated	moves	over	the	well-trained	
agents
(a) Early stage of training (b) Early stage of training (c) Well-trained (d) Well-trained
Figure 2: Coordinated moves without collision in combat 3 Marines (ours) vs. 1 Super Zergling
(enemy). The first two (a) and (b) illustrate that the collision happens when the agents are close by
during the early stage of the training; the last two (c) and (d) illustrate coordinated moves over the
well-trained agents.
Combat	3	Marines	(ours)	vs.	1	Super	Zergling (enemy)	
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally-
Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
“Hit and Run” tactics
combat	3	Marines	(ours)	vs.	1	Zealot	(enemy)
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally-
Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
(a) Early stage of training (b) Early stage of training (c) Well-trained (d) Well-trained
Figure 2: Coordinated moves without collision in combat 3 Marines (ours) vs. 1 Super Zergling
(enemy). The first two (a) and (b) illustrate that the collision happens when the agents are close by
during the early stage of the training; the last two (c) and (d) illustrate coordinated moves over the
well-trained agents.
(a) time step 1: run when
attacked
(b) time step 2: fight back
when safe
(c) time step 3: run again
Attack
Move
Enemy
(d) time step 4: fight back
again
Figure 3: Hit and Run tactics in combat 3 Marines (ours) vs. 1 Zealot (enemy).
efficiently propagated through the entire networks. Yet, unlike CommNet [20], our communication is
not fully symmetric, and we maintain certain social conventions and roles by fixing the order of the
agents that join the RNN. This would help solving any possible tie between multiple optimal joint
actions [35, 36].
The structure of our bidirectionally-coordinated net (BiCNet) is illustrated in Fig. 1. It consists of
Coordinated	moves	without	collision
Combat	3	Marines	(ours)	vs.	1	Zergling (enemy)
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally-
Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
(a) time step 1 (b) time step 2 (c) time step 3
Attack
Move
Enemy
(d) time step 4
Figure 4: Coordinated cover attack in combat 3 Marines (ours) vs. 1 Zergling (enemy).
Table 1: Winning rate against difficulty settings by hit points (HP) and damage. Training steps:
100k/200k/300k.
Difficulty
Damage=4 Damage=3
Focus fire
combat	15	Marines	(ours)	vs.	16	Marines	(enemy)
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally-
Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
(a) time step 1 (b) time step 2 (c) time step 3
Attack
Move
(d) time step 4
Figure 5: "focus fire" in combat 15 Marines (ours) vs. 16 Marines (enemy).
Coordinated	heterogeneous	agents
combat	2	Dropships	and	2	tanks	vs.	1	Ultralisk
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally-
Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
(a) time step 1 (b) time step 2 (c) time step 3 (d) time step 4
Figure 5: "focus fire" in combat 15 Marines (ours) vs. 16 Marines (enemy).
(a) time step 1
Attack
Enemy
Load
Unload
(b) time step 2
igure 6: Coordinated heterogeneous agents in combat 2 Dropships and 2 tanks vs. 1 Ultralisk
ver way. Neither scattering over all enemies nor focusing on one enemy (wasting attacking fi
lso called overkill) are desired. The grouping design in the policy network serves as the
or for BiCNet to learn “focus fire without overkill”. In our experiments, we dynamically gro
agents based on agents’ geometric locations. Based on the grouping inputs, BiCNet manage
AI playing StarCraft demo
in collaboration with Alibaba group
Building	a	persona:
Freud's	model	of	the	human	mind
• the	id is	the	primitive	and	
instinctual	part	of	the	
mind	that	contains	sexual	
and	aggressive	drives	and	
hidden	memories
• the	super-ego operates	as	
a	moral	conscience;	
• the	ego is	the	realistic	part	
that	mediates between	
the	desires	of	the	id	and	
the	super-ego https://en.wikipedia.org
意
识
潜意识
Reinforcement	Learning	with	1	millions	agents
Q-network
Experience
Buffer
(Obs, ID)
Q-value
(Obs, ID)
Q-value
(Obs, ID)
Q-value
(Obs, ID)
Q-value
(st, at, rt, st+1)
updates action
ID embedding
action
reward
action
reward
reward
...
...
(st, at, rt, st+1)
(st, at, rt, st+1)
1
2
3 4
6 5
Figure 2: Million-agent Q-learning in Predator-prey World.
borating with others. We keep alternating the environments by feeding these two
Yaodong Yang et al, An Empirical Study of Collective Behaviors in Many-agent Reinforcement Learning, submitted, 2017
Artificial	Population	vs	Real	Population
Yaodong Yang et al, An Empirical Study of Collective Behaviors in Many-agent Reinforcement Learning, submitted, 2017
Thanks for your attention
http://www.thisisbarry.com/single-post/2015/12/28/The-Thirteenth-Floor-1999-Explained

Contenu connexe

Tendances

convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)
RakeshSaran5
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature Learning
Amgad Muhammad
 

Tendances (20)

Deep Neural Networks 
that talk (Back)… with style
Deep Neural Networks 
that talk (Back)… with styleDeep Neural Networks 
that talk (Back)… with style
Deep Neural Networks 
that talk (Back)… with style
 
Intro To Convolutional Neural Networks
Intro To Convolutional Neural NetworksIntro To Convolutional Neural Networks
Intro To Convolutional Neural Networks
 
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...
AI&BigData Lab. Артем Чернодуб  "Распознавание изображений методом Lazy Deep ...AI&BigData Lab. Артем Чернодуб  "Распознавание изображений методом Lazy Deep ...
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
PhD Defense
PhD DefensePhD Defense
PhD Defense
 
Deep learning
Deep learningDeep learning
Deep learning
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Deep Learning Representations for All (a.ka. the AI hype)
Deep Learning Representations for All (a.ka. the AI hype)Deep Learning Representations for All (a.ka. the AI hype)
Deep Learning Representations for All (a.ka. the AI hype)
 
Lecture3 xing fei-fei
Lecture3 xing fei-feiLecture3 xing fei-fei
Lecture3 xing fei-fei
 
Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...
Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...
Convolutional Neural Network for Alzheimer’s disease diagnosis with Neuroim...
 
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
Language and Vision (D3L5 2017 UPC Deep Learning for Computer Vision)
 
convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)
 
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
Deep convnets for global recognition (Master in Computer Vision Barcelona 2016)
 
Andrew Ng, Chief Scientist at Baidu
Andrew Ng, Chief Scientist at BaiduAndrew Ng, Chief Scientist at Baidu
Andrew Ng, Chief Scientist at Baidu
 
Transfer Learning: An overview
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overview
 
MLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learningMLIP - Chapter 3 - Introduction to deep learning
MLIP - Chapter 3 - Introduction to deep learning
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
 
Unsupervised Feature Learning
Unsupervised Feature LearningUnsupervised Feature Learning
Unsupervised Feature Learning
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015Lecture 29 Convolutional Neural Networks -  Computer Vision Spring2015
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
 

Similaire à Artificial Collective Intelligence

Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
butest
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
Ilya Grigorik
 
從新一波人工智慧與大數據浪潮看「不當行為」
從新一波人工智慧與大數據浪潮看「不當行為」從新一波人工智慧與大數據浪潮看「不當行為」
從新一波人工智慧與大數據浪潮看「不當行為」
Craig Chao
 
introducción a Machine Learning
introducción a Machine Learningintroducción a Machine Learning
introducción a Machine Learning
butest
 
introducción a Machine Learning
introducción a Machine Learningintroducción a Machine Learning
introducción a Machine Learning
butest
 

Similaire à Artificial Collective Intelligence (20)

Deep Reinforcement Leaning In Machine Learning
Deep Reinforcement Leaning In Machine LearningDeep Reinforcement Leaning In Machine Learning
Deep Reinforcement Leaning In Machine Learning
 
machine learning a gentle introduction 2018 (edited)
machine learning a gentle introduction 2018 (edited)machine learning a gentle introduction 2018 (edited)
machine learning a gentle introduction 2018 (edited)
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
 
Keynote at the 2018 SIGGRAPH Conference on Motion, Interaction and Games
Keynote at the 2018 SIGGRAPH Conference on Motion, Interaction and GamesKeynote at the 2018 SIGGRAPH Conference on Motion, Interaction and Games
Keynote at the 2018 SIGGRAPH Conference on Motion, Interaction and Games
 
GDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game DevelopmentGDC2019 - SEED - Towards Deep Generative Models in Game Development
GDC2019 - SEED - Towards Deep Generative Models in Game Development
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
AI - Exploring Frontiers
AI - Exploring FrontiersAI - Exploring Frontiers
AI - Exploring Frontiers
 
Ai 管理人看人工智慧、發展與應用變革
Ai 管理人看人工智慧、發展與應用變革Ai 管理人看人工智慧、發展與應用變革
Ai 管理人看人工智慧、發展與應用變革
 
許永真/Crowd Computing for Big and Deep AI
許永真/Crowd Computing for Big and Deep AI許永真/Crowd Computing for Big and Deep AI
許永真/Crowd Computing for Big and Deep AI
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
A General Overview of Machine Learning
A General Overview of Machine LearningA General Overview of Machine Learning
A General Overview of Machine Learning
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
 
Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...
Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...
Artem Baklanov - Votes Aggregation Techniques in Geo-Wiki Crowdsourcing Game:...
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
 
A Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningA Friendly Introduction to Machine Learning
A Friendly Introduction to Machine Learning
 
How DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoHow DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of Go
 
從新一波人工智慧與大數據浪潮看「不當行為」
從新一波人工智慧與大數據浪潮看「不當行為」從新一波人工智慧與大數據浪潮看「不當行為」
從新一波人工智慧與大數據浪潮看「不當行為」
 
introducción a Machine Learning
introducción a Machine Learningintroducción a Machine Learning
introducción a Machine Learning
 
introducción a Machine Learning
introducción a Machine Learningintroducción a Machine Learning
introducción a Machine Learning
 

Plus de Jun Wang

Wsdm2015
Wsdm2015Wsdm2015
Wsdm2015
Jun Wang
 
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display AdvertisingWeinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
Jun Wang
 
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
Jun Wang
 
Portfolio Theory of Information Retrieval
Portfolio Theory of Information RetrievalPortfolio Theory of Information Retrieval
Portfolio Theory of Information Retrieval
Jun Wang
 
On Statistical Analysis and Optimization of Information Retrieval Effectivene...
On Statistical Analysis and Optimization of Information Retrieval Effectivene...On Statistical Analysis and Optimization of Information Retrieval Effectivene...
On Statistical Analysis and Optimization of Information Retrieval Effectivene...
Jun Wang
 
Statistical Information Retrieval Modelling: from the Probability Ranking Pr...
Statistical Information Retrieval Modelling:  from the Probability Ranking Pr...Statistical Information Retrieval Modelling:  from the Probability Ranking Pr...
Statistical Information Retrieval Modelling: from the Probability Ranking Pr...
Jun Wang
 
Financial methods in online advertising
 Financial methods in online advertising Financial methods in online advertising
Financial methods in online advertising
Jun Wang
 

Plus de Jun Wang (10)

A Society of AI Agents
A Society of AI AgentsA Society of AI Agents
A Society of AI Agents
 
Wsdm17 value-at-risk-bidding
Wsdm17 value-at-risk-biddingWsdm17 value-at-risk-bidding
Wsdm17 value-at-risk-bidding
 
Wsdm2015
Wsdm2015Wsdm2015
Wsdm2015
 
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display AdvertisingWeinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
 
On Search, Personalisation and Real-time Advertising
On Search, Personalisation and Real-time AdvertisingOn Search, Personalisation and Real-time Advertising
On Search, Personalisation and Real-time Advertising
 
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
 
Portfolio Theory of Information Retrieval
Portfolio Theory of Information RetrievalPortfolio Theory of Information Retrieval
Portfolio Theory of Information Retrieval
 
On Statistical Analysis and Optimization of Information Retrieval Effectivene...
On Statistical Analysis and Optimization of Information Retrieval Effectivene...On Statistical Analysis and Optimization of Information Retrieval Effectivene...
On Statistical Analysis and Optimization of Information Retrieval Effectivene...
 
Statistical Information Retrieval Modelling: from the Probability Ranking Pr...
Statistical Information Retrieval Modelling:  from the Probability Ranking Pr...Statistical Information Retrieval Modelling:  from the Probability Ranking Pr...
Statistical Information Retrieval Modelling: from the Probability Ranking Pr...
 
Financial methods in online advertising
 Financial methods in online advertising Financial methods in online advertising
Financial methods in online advertising
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Artificial Collective Intelligence

  • 2. Deep Reinforcement learning • Computerised agent: Learning what to do – How to map situations (states) to actions so as to maximise a numerical reward signal Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998.
  • 5. What is next? • All above are single AI unit • But, true human intelligence embraces social and collective wisdom – collective efforts would solve the problem otherwise unthinkable e.g., esp game. Crowdsourcing • A next grand challenge of AI – How large-scale multiple AI agents could learn human-level collaborations (or competitions) from their experiences?
  • 6. What is next? • All above are single AI unit • But, true human intelligence embraces social and collective wisdom – collective efforts would solve the problem otherwise unthinkable e.g., esp game. Crowdsourcing • A next grand challenge of AI – How large-scale multiple AI agents could learn human-level collaborations (or competitions) from their experiences? Artificial Collective Intelligence
  • 7. Artificial Collective Intelligence • Huge applications space – Trading robots gaming on the stock markets, – Ad bidding agents competing with each other over online advertising exchanges – E-commerce collaborative filtering recommenders predicting user interests through the wisdom of the crowd – Traffic control – Self-driving car – Creativity learning (generative txts, images, music, poetry) – …
  • 8. Summary • Learning to compete – Designing game environment – Machine Bidding in auction – Creativity learning (generating texts, images, music, poetry) • Learning to collaborate – AI plays StarCraft game
  • 9. Summary • Learning to compete – Designing game environment – Machine Bidding in auction – Creativity learning (generating texts, images, music, poetry) • Learning to collaborate – AI plays StarCraft game
  • 11. Controllable Environments • We consider the environment is controllable and strategic • A mini-max game between the agent and the environment Haifeng, Zhang, et al, Learning to Generate (Adversarial) Environments in Deep Reinforcement Learning, under submission, 2017 1. Generate Environments 2. Each environment trains an agent 3. Operate in the environments with 4. Agent return G ...G1 6 Agent πµ θ θ A Environment Generator M ϕw θ1 A M θ2 A M θ3 A M θ4 A M θ5 A M θ6 A M respectively...πϕ1 πϕ6 generator update guide the 1: Framework dealing with non-differentiable transitions. Generator generates environmen ter ✓. For each ✓, agents are trained until optimal policies are obtained. Then agents are teste esponding environments and returns are observed, which finally guide the generator to updat olution for Undifferentiable Transition gh we have proved the equivalence between the transition optimization and the policy o In this paper, we consider a particular objective of MDP that the MDP acts as an83 environment minimizing the expected return of the agent, i.e. O(H) = P1 t=1 t 84 Thus, the objective function is formulated as:85 ✓⇤ = arg min ✓ max E[G|⇡ ; M✓ = hS, A, P✓, R, i]. This adversarial objective can be applied to design environments to analyse the weakness86 and its policy learning algorithms.87
  • 12. Controllable Environments: An example • Maze: • Agent: try to find an optimal strategy to find the way out. • Environment: generate a Maze to make it difficult to find a way Haifeng, Zhang, et al, Learning to Generate (Adversarial) Environments in Deep Reinforcement Learning, under submission, 2017
  • 13. Design Maze: Results Haifeng, Zhang, et al, Learning to Generate (Adversarial) Environments in Deep Reinforcement Learning, under submission, 2017 DFS DQNOptimal RHS
  • 14. Summary • Learning to compete – Designing game environment – Machine Bidding in auction – Creativity learning (generating texts, images, music, poetry) • Learning to collaborate – AI plays StarCraft game
  • 15. Four auctions • “Open cry” auctions 1. English Auctions 2. Dutch Auctions • “Sealed bid” auctions 3. 1st-price/”pay-your-bid” auctions 4. 2nd-price/Vickrey auctions $2 $3$8 $5
  • 18. Online Advertising + Artificial Intelligence • Design learning algorithms to make the best match between the advertisers and Internet users with economic constraints •Transformed from a low-tech process to highly optimized, mathematical, computer-centric (Wall Street-like) process • Key directions: operations research, estimating CTR/AR; auction systems; machine learning algorithms; behavioral targeting; fighting spam (click fraud)
  • 21. Can we have a dynamic model? Bidding in RTB as an RL problem Advertiser with ad budget Environment auction result, user response bid request xt+1 bid request xt bid price at • From the perspective of an advertiser with budget, sequentially bidding in RTB is a reinforcement learning (RL) problem. • The goal is to maximize the user responses on the displayed ads. Cai, H., K. Ren, W. Zhag, K. Malialis, and J. Wang. "Real-Time Bidding by Reinforcement Learning in Display Advertising." In The Tenth ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2017.
  • 22. MDP Formulation of RTB Environment [s] left auction 𝑻 [s] left budget 𝑩 𝑻 2. [a] bid 𝒂 1. [s] bid request 𝒙 𝑻 3. [p] auction result 3. [r] user response [s] left auction 𝑻 − 𝟏 [s] left budget 𝑩 𝑻'𝟏 [s] left auction 𝟎 [s] left budget 𝑩 𝟎 next episode • Consider bidding in RTB as an episodic process. [s] state [a] action [p] state transition [r] reward Cai, H., K. Ren, W. Zhag, K. Malialis, and J. Wang. "Real-Time Bidding by Reinforcement Learning in Display Advertising." In The Tenth ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2017.
  • 23. Summary • Learning to compete – Designing game environment – Machine bidding in auction – Creativity learning (generating texts, images, music, poetry) • Learning to collaborate – AI plays StarCraft game
  • 24. Generative Models • Classic machine learning tasks (label prediction) • Generation tasks (generating actual data) 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 rial Nets with Labeled Data by Activation Maximization (a) Real Images (b) Generated Images Figure 5. MNIST results. 0 1 2 3 4 5 6 7 8 9 recognition 770 771 772 773 774 775 776 777 778 779 Generative Adversarial Nets with Labeled Data by Activation Maximization (a) Real Images (b) Generat 0 1 2 3 4 5 6 7 8 9 generation High-dimension -> low dimension low-dimension -> high-dimension
  • 26. Generative Adversarial Nets (GANs) • Minimax game between a discriminator & a generator: – Discriminator (D) tries to correctly distinguish the true data and the fake model-generated data – Generator (G) tries to generate high-quality data to fool discriminator • G & D can be implemented via neural networks • Ideally, when D cannot distinguish the true and generated data, G nicely fits the true underlying data distribution [Goodfellow I, Pouget-Abadie J, Mirza M, et al. 2014. Generative adversarial nets. In NIPS 2014.]
  • 27. Labeled Generative Adversarial Nets • Discriminator is multi-class classifier trained with labelled data G D Generator Discriminator Sample 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 tivation Maximization Real Images (b) Generated Images Figure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 on Maximization mages (b) Generated Images Figure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 Activation Maximization ) Real Images (b) Generated Images Figure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 by Activation Maximization (a) Real Images (b) Generated Images Figure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 Activation Maximization ) Real Images (b) Generated Images Figure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 Maximization es (b) Generated Images ure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 a by Activation Maximization (a) Real Images (b) Generated Images Figure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 ation Maximization Images (b) Generated Images Figure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 aximization (b) Generated Images 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 mization (b) Generated Images MNIST results. G Loss hLabeledDatabyActivationMaximization m- li- ng od nd ass Be- ng he he ny (4) )i Class1Class2 GeneratedSample FinalGradient forG Gradient1Gradient2 Figure1.TheproblemofoverlayedgradientofLabGAN(Sal- imansetal.,2016)frommulti-moderealdata.Weassumethe logitisbuiltbasedonthedistancebetweenthegradientsample andtheclasscenter. where ↵lab k(x)= (Dk(x) Dr(x) k2{1,...,K} 1k=K+1 .(8) Fromtheformulation,weseethattheoverallgradientw.r.t generatedexamplexis(1Dr(x)).Thisisconsistent withtheoriginalGAN(Goodfellowetal.,2014)whenno labelinformationisgiven.Thegradientonrealisthen furtherdistributedtoeachrealclasslogitaccordingtoits Averaged Loss from predicted labels [Zhiming Zhou, Shu Rong, Han Cai, Weinan Zhang, Yong Yu, Jun Wang Generative Adversarial Nets with Labeled Data by Activation Maximization, 2017 ]
  • 28. Activation Maximisation Generative Adversarial Nets • Discriminator is multi-class classifier trained with labelled data G D Generator Discriminator Sample 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 tivation Maximization Real Images (b) Generated Images Figure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 on Maximization mages (b) Generated Images Figure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 Activation Maximization ) Real Images (b) Generated Images Figure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 by Activation Maximization (a) Real Images (b) Generated Images Figure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 Activation Maximization ) Real Images (b) Generated Images Figure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 Maximization es (b) Generated Images ure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 a by Activation Maximization (a) Real Images (b) Generated Images Figure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 ation Maximization Images (b) Generated Images Figure 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 aximization (b) Generated Images 5. MNIST results. 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 mization (b) Generated Images MNIST results. G Loss hLabeledDatabyActivationMaximization m- li- ng od nd ass Be- ng he he ny (4) )i Class1Class2 GeneratedSample FinalGradient forG Gradient1Gradient2 Figure1.TheproblemofoverlayedgradientofLabGAN(Sal- imansetal.,2016)frommulti-moderealdata.Weassumethe logitisbuiltbasedonthedistancebetweenthegradientsample andtheclasscenter. where ↵lab k(x)= (Dk(x) Dr(x) k2{1,...,K} 1k=K+1 .(8) Fromtheformulation,weseethattheoverallgradientw.r.t generatedexamplexis(1Dr(x)).Thisisconsistent withtheoriginalGAN(Goodfellowetal.,2014)whenno labelinformationisgiven.Thegradientonrealisthen furtherdistributedtoeachrealclasslogitaccordingtoits [Zhiming Zhou, Shu Rong, Han Cai, Weinan Zhang, Yong Yu, Jun Wang Generative Adversarial Nets with Labeled Data by Activation Maximization, 2017 ] Activation Maximised
  • 29. GAN with Activation Maximisation [Zhiming Zhou, Shu Rong, Han Cai, Weinan Zhang, Yong Yu, Jun Wang Generative Adversarial Nets with Labeled Data by Activation Maximization, 2017 ] ed Data by Activation Maximization Class 1 Class 2 Generated Sample Final Gradient for G Gradient 1 Gradient 2 Figure 1. The problem of overlayed gradient of LabGAN (Sal- imans et al., 2016) from multi-mode real data. We assume the logit is built based on the distance between the gradient sample and the class center. where ↵lab k (x) = (Dk(x) Dr(x) k 2 {1, . . . , K} 1 k = K+1 . (8) From the formulation, we see that the overall gradient w.r.t generated example x is (1 Dr(x)). This is consistent with the original GAN (Goodfellow et al., 2014) when no Generative Adversarial Nets with Labeled Data by Activation Maximization 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -1.5 1.5-1.0 1.0-0.5 0.5-0.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -1.5 1.5-1.0 1.0-0.5 0.5-0.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -1.5 1.5-1.0 1.0-0.5 0.5-0.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -1.5 1.5-1.0 1.0-0.5 0.5-0.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -1.5 1.5-1.0 1.0-0.5 0.5-0.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -1.5 1.5-1.0 1.0-0.5 0.5-0.0 LabGAN Iteration:50k NLL:17.86 LabGAN Iteration:150k NLL:17.11 LabGAN Iteration:200k NLL:16.71 SAM-GAN Iteration:50k NLL:17.66 SAM-GAN Iteration:150k NLL:15.94 SAM-GAN Iteration:200k NLL:15.79 Real data p.d.f. Gen. data Figure 2. The generated examples along with the true density distribution on synthetic data. Figure 3. Training iterations on the synthetic data measured with NNL by Oracle. Iterations truck ship hourse frog dog deer cat bird automobile airplane 5,000 5.79 8.31 8.55 8.74 8.84 9.20 9.29 6.90 7.74 8.01 8.17 10,000 15,000 30,000 150,000 300,000 Inception AM score score 8.34 Figure 4. CIFAR-10 progress results. Generative Adversarial Nets with Labeled Data by Activation Maximization (a) Real Images (b) Generated Images Figure 5. MNIST results.
  • 30. SeqGAN – Sequence generation • Generator is a reinforcement learning policy generating a sequence – decide the next word to generate (action) given the previous ones as the state • Discriminator provides the reward (i.e. the probability of being true data) for the whole sequence Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. AAAI 2017.
  • 31. Experiments on Synthetic Data • Evaluation measure with Oracle Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. AAAI 2017.
  • 33. Obama Speech Text Generation • i stood here today i have one and most important thing that not on violence throughout the horizon is OTHERS american fire and OTHERS but we need you are a strong source • for this business leadership will remember now i can’t afford to start with just the way our european support for the right thing to protect those american story from the world and • i want to acknowledge you were going to be an outstanding job times for student medical education and warm the republicans who like my times if he said is that brought the • When he was told of this extraordinary honor that he was the most trusted man in America • But we also remember and celebrate the journalism that Walter practiced -- a standard of honesty and integrity and responsibility to which so many of you have committed your careers. It's a standard that's a little bit harder to find today • I am honored to be here to pay tribute to the life and times of the man who chronicled our time. Human Machine
  • 34. Summary • Learning to compete – Machine Bidding in auction – Creativity learning (generating texts, images, music, poetry) • Learning to collaborate – AI plays StarCraft game
  • 35. AI plays StarCraft • One of the most difficult games for computers • At least 101685 possible states (for reference, the game of Go has about 10170 states)! • how large-scale multiple AI agents could learn human-level collaborations, or competitions, from their experiences?
  • 36. Bidirectional-Coordinated nets (BiCNet) Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally- Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
  • 37. Unsupervised training without human demonstration and labelled data
  • 38. Coordinated moves without collision • The first two (a) and (b) illustrate that the collision happens when the agents are close by during the early stage of the training; • the last two (c) and (d) illustrate coordinated moves over the well-trained agents (a) Early stage of training (b) Early stage of training (c) Well-trained (d) Well-trained Figure 2: Coordinated moves without collision in combat 3 Marines (ours) vs. 1 Super Zergling (enemy). The first two (a) and (b) illustrate that the collision happens when the agents are close by during the early stage of the training; the last two (c) and (d) illustrate coordinated moves over the well-trained agents. Combat 3 Marines (ours) vs. 1 Super Zergling (enemy) Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally- Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
  • 39. “Hit and Run” tactics combat 3 Marines (ours) vs. 1 Zealot (enemy) Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally- Coordinated Nets for Learning to Play StarCraft Combat Games, 2017 (a) Early stage of training (b) Early stage of training (c) Well-trained (d) Well-trained Figure 2: Coordinated moves without collision in combat 3 Marines (ours) vs. 1 Super Zergling (enemy). The first two (a) and (b) illustrate that the collision happens when the agents are close by during the early stage of the training; the last two (c) and (d) illustrate coordinated moves over the well-trained agents. (a) time step 1: run when attacked (b) time step 2: fight back when safe (c) time step 3: run again Attack Move Enemy (d) time step 4: fight back again Figure 3: Hit and Run tactics in combat 3 Marines (ours) vs. 1 Zealot (enemy). efficiently propagated through the entire networks. Yet, unlike CommNet [20], our communication is not fully symmetric, and we maintain certain social conventions and roles by fixing the order of the agents that join the RNN. This would help solving any possible tie between multiple optimal joint actions [35, 36]. The structure of our bidirectionally-coordinated net (BiCNet) is illustrated in Fig. 1. It consists of
  • 40. Coordinated moves without collision Combat 3 Marines (ours) vs. 1 Zergling (enemy) Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally- Coordinated Nets for Learning to Play StarCraft Combat Games, 2017 (a) time step 1 (b) time step 2 (c) time step 3 Attack Move Enemy (d) time step 4 Figure 4: Coordinated cover attack in combat 3 Marines (ours) vs. 1 Zergling (enemy). Table 1: Winning rate against difficulty settings by hit points (HP) and damage. Training steps: 100k/200k/300k. Difficulty Damage=4 Damage=3
  • 41. Focus fire combat 15 Marines (ours) vs. 16 Marines (enemy) Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally- Coordinated Nets for Learning to Play StarCraft Combat Games, 2017 (a) time step 1 (b) time step 2 (c) time step 3 Attack Move (d) time step 4 Figure 5: "focus fire" in combat 15 Marines (ours) vs. 16 Marines (enemy).
  • 42. Coordinated heterogeneous agents combat 2 Dropships and 2 tanks vs. 1 Ultralisk Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally- Coordinated Nets for Learning to Play StarCraft Combat Games, 2017 (a) time step 1 (b) time step 2 (c) time step 3 (d) time step 4 Figure 5: "focus fire" in combat 15 Marines (ours) vs. 16 Marines (enemy). (a) time step 1 Attack Enemy Load Unload (b) time step 2 igure 6: Coordinated heterogeneous agents in combat 2 Dropships and 2 tanks vs. 1 Ultralisk ver way. Neither scattering over all enemies nor focusing on one enemy (wasting attacking fi lso called overkill) are desired. The grouping design in the policy network serves as the or for BiCNet to learn “focus fire without overkill”. In our experiments, we dynamically gro agents based on agents’ geometric locations. Based on the grouping inputs, BiCNet manage
  • 43. AI playing StarCraft demo in collaboration with Alibaba group
  • 44. Building a persona: Freud's model of the human mind • the id is the primitive and instinctual part of the mind that contains sexual and aggressive drives and hidden memories • the super-ego operates as a moral conscience; • the ego is the realistic part that mediates between the desires of the id and the super-ego https://en.wikipedia.org 意 识 潜意识
  • 45. Reinforcement Learning with 1 millions agents Q-network Experience Buffer (Obs, ID) Q-value (Obs, ID) Q-value (Obs, ID) Q-value (Obs, ID) Q-value (st, at, rt, st+1) updates action ID embedding action reward action reward reward ... ... (st, at, rt, st+1) (st, at, rt, st+1) 1 2 3 4 6 5 Figure 2: Million-agent Q-learning in Predator-prey World. borating with others. We keep alternating the environments by feeding these two Yaodong Yang et al, An Empirical Study of Collective Behaviors in Many-agent Reinforcement Learning, submitted, 2017
  • 46. Artificial Population vs Real Population Yaodong Yang et al, An Empirical Study of Collective Behaviors in Many-agent Reinforcement Learning, submitted, 2017
  • 47. Thanks for your attention http://www.thisisbarry.com/single-post/2015/12/28/The-Thirteenth-Floor-1999-Explained