1. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Reinforcement Learning For
Taxi Rebalancing
強化学習アーキテクチャ勉強会
January., 2020
Takuma Oda
Mobility Intelligence Development Dept.
Automotive Business Unit
DeNA Co., Ltd.
2. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
MOVI: A Model-Free Approach to Dynamic Fleet
Management
Takuma Oda and Carlee Joe-Wong ; IEEE INFOCOM 2018
3. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
目的
配車アプリにおける空車車両の移動を最適化したい
=> ユーザー待ち時間を最小化、配車成立率を最大化、車両稼働率を最大化したい
モチベーション
⁃ UX向上
⁃ ドライバーの生産性向上
⁃ プラットフォームの売上、競争力向上
Taxi rebalancing in ride-hailing service
4. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
運行管理
サーバー
観測情報
・過去の配車リクエスト
・車両状態
制御情報
・目的地or経路
Problem Definition
仮定
⁃ 需要はアプリ配車のみで配車成立に関わらず全て可観測
⁃ ドライバーの承諾率は100%で固定の配車可能距離に車両がいない場合は配車非成立
⁃ 全ての車両情報はリアルタイムで観測可能
⁃ 空車車両行動(ディスパッチ)は完全に制御可能、配車マッチングロジックは制御不可
5. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
RHC: Receding Horizon Control(=MPC: Model Predictive Control)
⁃ システムのダイナミクスをモデル化し、毎ステップ、有限期間(horizon)の目的関数を最適
化するような一連の制御情報を算出する
⁃ 実際の制御の実行は現在のタイムスロットのみ行う
⁃ モデル化によるエラー(需要供給の不確実性、車両ダイナミクス)が発生する
⁃ 毎ステップ最適化問題を解く必要があるため、応答性(オンデマンド性)が低い
Model-free Reinforcement Learning
⁃ 環境(システムのダイナミクス)をモデル化せず、環境と相互作用しながら学習
⁃ 車両ごとの問題に分割するのが容易かつ行列計算のみなので応答性が高い
Approach
6. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
RHC formulation
各ステップにおける i => j へ移動させる車両数
(u)を求めたい
制御車両数の上限はそのステップ t において i に
いる車両数
非成立数 車両移動コスト
車両数 x のダイナミクスモデル各ステップにおける目的関数(=reward)
需要予測、移動時間予測は過去データから学習した
回帰モデルを用いる
7. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Model-free RL formulation
環境エージェント
行動 a
状態 s
報酬 r
行動:次ステップの目的地(グリッドで離散化)
状態:需要予測、空車車両数など
報酬:乗車有無 – 走行コスト
個々のタクシー問題(IQL: Independent Q-learning)として定義し直すことで状態・行動空間が激減
8. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
RL Algorithm
ベルマンエラーを最小化
Experience Memoryには全ての車両の行動を蓄積
推論は全車両共通のモデルで実行
状態ヒートマップを定期的に更新
(学習時はε greedyで探索)
シミュレータで目的地(行動)までの
最短経路の軌跡を模擬
9. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Q-network
ある状態における行動空間の(中心を現在位置とした)行動価値を出力するネットワークを学習
現在位置の周囲の需要・供給ヒートマップなどを入力として用いる
遠くの情報を伝播させるため予め空間的に平均した特徴量を入れる
行動価値のヒートマップ
需要予測ヒートマップ
空車数ヒートマップ
10. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Agent
Demand
Prediction
RHC/DQN
Policies
Fleet Object
Ride
Requests
Dispatcher
ETA Model
OSM Road
Network
Matching
Dispatch
Route
Trip Time
Environment Simulator
wt - 1
Ft
at
Simulation Architecture
11. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Experiment
NYCにおいて車両数8000で学習、評価
全く移動しない戦略(NO)とRHC, DQNを比較
DQNがRHCを上回るパフォーマンス
推論の遅れを同じにする(DQN*)とRHCと同程度の結果
車両パフォーマンスの分散はDQNの方が低く、ドライバーへの公平性が保てる
平均稼働率 最小稼働率非成立率 待ち時間 空車走行時間
12. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Distributed Fleet Control with Maximum Entropy Deep
Reinforcement Learning
Takuma Oda and Yulia Tachibana ; NeurIPS 2018 Workshop
13. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Distributed Diffusion Control Network
Maximum Entropy RL (Soft Q-learning) & Stochastic Policy
Graph Diffusion Convolution
14. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Maximum Entropy RL
Haarnoja, T., Tang, H., Abbeel, P. & Levine, S. Reinforcement Learning with Deep Energy-
Based Policies. In Proceedings of the 34th International Conference on Machine Learning
(ICML) 70, 2017.
目的関数
⁃ 軌跡全体のエントロピーを最大化する(どの行動が最適か自信がつくまで複数の行動の選択肢を残しておく)こと
で外乱や不確実性への耐性が強まる
⁃ 例えば報酬に時間依存性や不確実性がある場合、ある場所に向かっている途中に(期待される)需要が少ないこと
や、他の競合する車両が集まっていることが判明した場合、別の選択肢があればピボットが可能
Soft Q-learning
15. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Diffusion Convolution
Li, Y., Yu, R., Shahabi, C. & Liu, Y. Diffusion Convolutional Recurrent Neural
Network: Data-Driven Traffic Forecasting. In ICLR, 2018.
グリッドで離散化された特徴量を使うと道路ネットワークの構造が反映されず、移動コスト
を正しく見積もれない
Diffusion Convolutionを使って特徴量をネットワーク距離で伝播させる
road1はroad2, road3とユークリッド距離は近いが
反対車線であるroad3の速度との相関は低い
16. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Diffusion Convolution
フィルターとして中心メッシュからの道路ネットワーク移動時間
状態(需要・供給)情報更新時に予めdiffusion operation を行なっておけば推論時には不要
特徴量がマトリックスであるため道路ネットワークの重みを取り入れつつも計算は高速
Filter Examples
Diffused Feature Map
17. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Simulation Architecture
18. Copyright (C) 2018 DeNA Co.,Ltd. All Rights Reserved.
Experiment
Travel time をフィルターに用いたDiffusionとuniform filterで比較するとtravel timeフィル
ターが全体的に高いスコア
Soft Q-learningはHard Q-learningと比べて待ち時間が顕著に減少
Notes de l'éditeur
In traditional taxi networks, individual drivers look for passengers hailing on the street. They are relying on their experience and knowledge
But it can be inefficient if they don’t know future demand and are not coordinated
For instance, let’s say there are two vacant taxis on the streets and they cruise or are dispatched to this regions. But, customers may request rides at those locations. In this case, for both customers and drivers, dispatch decisions was not optimal. Either of drivers has to spend a lot of time on cruising
Modern ride-hailing fleet networks such as Uber and Lyft can track vehicles’ GPS location and passengers’ pickup location in real time.
This data can be utilized to predict passenger demand and vehicle mobility patterns in the future, which enables proactive dispatch of their vehicles to predicted future pickup locations
In this way, optimization of taxi dispatch can reduce passengers waiting time for a ride and increase drivers revenue
Let me define the problem more precisely
We assume that there are an environment and an agent. The environment consists of vehicles and passengers with a mobile app
The agent takes an action by dispatching. By dispatch, we mean sending a vacant taxi to an other location
Agent observes each vehicles’ location and availability status and all passengers pickup requests. Using this real-time information, agent determines proactive dispatching for vacant vehicles.
Since we focus on optimizing proactive dispatching, we incorporate the matching algorithm between passengers and available vehicles in the environment.
The agent goal is to optimize sequential dispatch decisions so as to maximize accumulated reward
In traditional taxi networks, individual drivers look for passengers hailing on the street. They are relying on their experience and knowledge
But it can be inefficient if they don’t know future demand and are not coordinated
For instance, let’s say there are two vacant taxis on the streets and they cruise or are dispatched to this regions. But, customers may request rides at those locations. In this case, for both customers and drivers, dispatch decisions was not optimal. Either of drivers has to spend a lot of time on cruising
Modern ride-hailing fleet networks such as Uber and Lyft can track vehicles’ GPS location and passengers’ pickup location in real time.
This data can be utilized to predict passenger demand and vehicle mobility patterns in the future, which enables proactive dispatch of their vehicles to predicted future pickup locations
In this way, optimization of taxi dispatch can reduce passengers waiting time for a ride and increase drivers revenue
The action variable for the baseline is the number of vehicles to send to each region, each time, denoted by u_t
We wish to choose the u_t to maximize reward, defined by a weighted sum of the number of rejects and the vehicles’ idle cruising time
The number of vehicles in next time slot t is computed by this transition model.
The first term corresponds to leftover vehicles as the results of pickups
The second term is the net number of vehicles dispatched to this region
The last two terms represent occupied vehicles dropping off passengers within time slot t+1
Assuming the future demand are known, we can find optimal dispatch actions to maximize accumulated reward in T horizon
Every time step, we solve RHC to determine next T step actions, but execute only current action.
The first constraint ensures that the total number of vehicles dispatched from i-th region must not exceed the number of idle vehicles
The second constraint ensures that we do not dispatch vehicles to regions with travel times that exceeds d_t and all dispatch movement completes within a time interval
For simplicity, we assume that u_tij are continuous variables; we can then solve optimization problem efficiently with Linear Programming methods
The action variable is where each taxi should go in the next timeslot
Similar to the baseline, we express reward function for each vehicle as weighted sum of pickup reward and idle cruising cost
We would like to learn optimal action-value function, which is defined as the maximum expected return achievable by any policy
Since the number of states space is huge, we use neural network function approximator for Q
For loss function, we use MSE and a target value is computed by bellman backup of current estimation
To evaluate RHC and DQN policies, we design and implement MOVI as a taxi fleet simulator
This diagram shows the MOVI architecture
Fleet object simulates states of all vehicles
In every time step, MOVI generates ride requests based on the real trip records and matches each request to vehicles by nearest neighbor algorithm
Next, the agent observes the current state of the environment which includes vehicle and requests information
The agent then computes the actions, using either RHC or DQN policy, and sends a dispatch order to idle vehicles
For each dispatch order, MOVI creates an estimated trajectory to the dispatched location by computing the shortest path in OSM road network graph
Finally, all vehicles update their states according to their matching and dispatch assignments
Dispatch policy is a separate module and does not affect other simulator modules so that we can compare different dispatch policies in the same settings