SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
.
The Over-the-Network Problem
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 2/18
...
2/18
.
Over-the-Network Problem
Data
Indexer
Index
Network
Traditional
Client
Data
Indexer
IndexRead,
Write
Stringex
Client
The
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 3/18
...
3/18
.
Everything is Over-the-Network
• ... in clouds
• ... inside data centers
• ... in home networks
.
When running over-the-network
..
.
... the biggest problem is that there is a hard physical limit to
throughput
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 4/18
...
4/18
.
The "Best" Tools Today
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 5/18
...
5/18
.
The Closests Tools
1. Lucene running locally only
2. Google Data APIs, that allow for shared control
◦ not really indexing, through
3. .... that's pretty much it!
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 6/18
...
6/18
.
Target Applications
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 7/18
...
7/18
.
Target Applications
Data
Indexer
Index
Stringex
Client
The
• server-less applications (read:
fully distributed)
• large-scale crowdsourcing
connected via cloud storage
• distributed storage --
the same problem
• ....
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 8/18
...
8/18
.
The Stringex Problem
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 9/18
...
9/18
.
The Stringex Problem
• a very straightforward optimization problem
minimize w1ROUT + w2RIN (1)
subject to (2)
0 < RIN ≤ ROUT ≤ C, (3)
SLOCAL ≤ M ≤ SREMOTE, (4)
NLOCAL ≤ NREMOTE ≤ NUSER, (5)
• R is rate, throughput, etc.
• S is storage size, can be local and
remote
• C and M are constants, set by user
• N is number of files over which the
index is split
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 10/18
...
10/18
.
Naive Stringex Client
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 11/18
...
11/18
.
Practical Assumptions
• JSON input, only top level is indexed, otherwise stringified
• several efficiency tricks
1. split index in relatively small files
2. distribute smoothly using random hashing
3. update parts on timeout -- accumulate multiple intensive updates
4. create special mapswhich allow for browsing
• JSON aggregations in files : one line is base64( JSON sring)
◦ if bzip2 algorithm is within reach, you can have base64( bzip2( JSON
string))
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 12/18
...
12/18
.
Naive Client: Data Structure
INPUT JSON { name : value1, age : value2, …}
Files
…name .imap
{
bk : {
ik : start,end ,
… next ik
},
… next bk
}
name .vmap
{
value : bk ,
… next value
}
name .bk1
name .bk2
…
Key: name
…
Key: age
docs .imap
{
bk : {
docid :
start,end ,
… next docid
},
… next bk
}
docs .bk1
docs .bk2
…
Docs
No . vmap
SameSame
Index Data
• meta is separate from
data
• smart maps, lets to read/
write sections of files
◦ specifically for chunk*
API in Dropbox
• filenames are head 2-3
symbols of MD5 hashes
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 13/18
...
13/18
.
Naive Client: Sync Engine Design
Stringex
Index
Stringex
Client
The
Sync
Engine
Optimization
Local
Cache
Check
1 2
Use
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 14/18
...
14/18
.
Evaluation
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 15/18
...
15/18
.
Stringex vs Lucene
3.15 3.85 4.55 5.25 5.95 6.65
Index Size (log)
2.55
2.65
2.75
2.85
2.95
3.05
3.15
3.25
Throughput(logofbytes/doc)
Lucene
Stringex
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 16/18
...
16/18
.
Wrapup
• https://github.com/maratishe/stringex has JS client
• I also have a PHP client for command line Stringex
• stringex is better for browsing because items cluster naturally -- better than
Lucene
◦ I use it for small browsable summaries of datasets
◦ ... and context-based browsable datasets
• many other uses are possible
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 17/18
...
17/18
.
That’s all, thank you ...
M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 18/18
...
18/18

Contenu connexe

Similaire à A New Practical Design for Browsable Over-the-Network Indexing

26.1.7 lab snort and firewall rules
26.1.7 lab   snort and firewall rules26.1.7 lab   snort and firewall rules
26.1.7 lab snort and firewall rules
Freddy Buenaño
 
Course Project Security Analysis and Redesign of a Network Object.docx
Course Project Security Analysis and Redesign of a Network Object.docxCourse Project Security Analysis and Redesign of a Network Object.docx
Course Project Security Analysis and Redesign of a Network Object.docx
marilucorr
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: Sharding
MongoDB
 
Chapter9 network managment-3ed
Chapter9 network managment-3edChapter9 network managment-3ed
Chapter9 network managment-3ed
Khánh Ghẻ
 

Similaire à A New Practical Design for Browsable Over-the-Network Indexing (20)

NiceCover: A Serverless Webapp for Crowdsourcing Data Extraction and Knowledg...
NiceCover: A Serverless Webapp for Crowdsourcing Data Extraction and Knowledg...NiceCover: A Serverless Webapp for Crowdsourcing Data Extraction and Knowledg...
NiceCover: A Serverless Webapp for Crowdsourcing Data Extraction and Knowledg...
 
Com 135 final project user manual
Com 135 final project user manualCom 135 final project user manual
Com 135 final project user manual
 
A Software Design and Algorithms for Multicore Capture in Data Center Forensics
A Software Design and Algorithms for Multicore Capture in Data Center ForensicsA Software Design and Algorithms for Multicore Capture in Data Center Forensics
A Software Design and Algorithms for Multicore Capture in Data Center Forensics
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
 
Running head network design 1 netwo
Running head network design                             1 netwoRunning head network design                             1 netwo
Running head network design 1 netwo
 
26.1.7 lab snort and firewall rules
26.1.7 lab   snort and firewall rules26.1.7 lab   snort and firewall rules
26.1.7 lab snort and firewall rules
 
3-Way Scripts as a Base Unit for Flexible Scale-Out Code
3-Way Scripts as a Base Unit for Flexible Scale-Out Code3-Way Scripts as a Base Unit for Flexible Scale-Out Code
3-Way Scripts as a Base Unit for Flexible Scale-Out Code
 
Course Project Security Analysis and Redesign of a Network Object.docx
Course Project Security Analysis and Redesign of a Network Object.docxCourse Project Security Analysis and Redesign of a Network Object.docx
Course Project Security Analysis and Redesign of a Network Object.docx
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Towards a Practical Method for Interactive Traffic Visualizations in Data Cen...
Towards a Practical Method for Interactive Traffic Visualizations in Data Cen...Towards a Practical Method for Interactive Traffic Visualizations in Data Cen...
Towards a Practical Method for Interactive Traffic Visualizations in Data Cen...
 
OHM CAD SYSTEM Capabilities
OHM CAD SYSTEM CapabilitiesOHM CAD SYSTEM Capabilities
OHM CAD SYSTEM Capabilities
 
How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...
 
Ccna3 mod1-classless routing
Ccna3 mod1-classless routingCcna3 mod1-classless routing
Ccna3 mod1-classless routing
 
Angular (v2 and up) - Morning to understand - Linagora
Angular (v2 and up) - Morning to understand - LinagoraAngular (v2 and up) - Morning to understand - Linagora
Angular (v2 and up) - Morning to understand - Linagora
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
MongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: ShardingMongoDB for Time Series Data Part 3: Sharding
MongoDB for Time Series Data Part 3: Sharding
 
Matrix - One-year in, Matthew Hodgson, Matrix.org
Matrix - One-year in, Matthew Hodgson, Matrix.orgMatrix - One-year in, Matthew Hodgson, Matrix.org
Matrix - One-year in, Matthew Hodgson, Matrix.org
 
Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You
 
Chapter9 network managment-3ed
Chapter9 network managment-3edChapter9 network managment-3ed
Chapter9 network managment-3ed
 
PACE-IT: Introduction to IPv4 (part 2) - N10 006
PACE-IT: Introduction to IPv4 (part 2) - N10 006 PACE-IT: Introduction to IPv4 (part 2) - N10 006
PACE-IT: Introduction to IPv4 (part 2) - N10 006
 

Plus de Tokyo University of Science

Plus de Tokyo University of Science (20)

A Method for Cloud-Assisted Secure Wireless Grouping of Client Devices at Net...
A Method for Cloud-Assisted Secure Wireless Grouping of Client Devices at Net...A Method for Cloud-Assisted Secure Wireless Grouping of Client Devices at Net...
A Method for Cloud-Assisted Secure Wireless Grouping of Client Devices at Net...
 
Ultrasound Relative Positioning for IoT Devices in Dense Wireless Spaces
Ultrasound Relative Positioning for IoT Devices in Dense Wireless SpacesUltrasound Relative Positioning for IoT Devices in Dense Wireless Spaces
Ultrasound Relative Positioning for IoT Devices in Dense Wireless Spaces
 
Towards a Packet Traffic Genome Project as a Method for Realtime Sub-Flow Tra...
Towards a Packet Traffic Genome Project as a Method for Realtime Sub-Flow Tra...Towards a Packet Traffic Genome Project as a Method for Realtime Sub-Flow Tra...
Towards a Packet Traffic Genome Project as a Method for Realtime Sub-Flow Tra...
 
What if We Atomize Student Data and Apps and Put Them on Docker Containers?
What if We Atomize Student Data and Apps and Put Them on Docker Containers?What if We Atomize Student Data and Apps and Put Them on Docker Containers?
What if We Atomize Student Data and Apps and Put Them on Docker Containers?
 
Large-Scale Crowdsourcing by Vehicular Data Packets in a Sparse Roadside Infr...
Large-Scale Crowdsourcing by Vehicular Data Packets in a Sparse Roadside Infr...Large-Scale Crowdsourcing by Vehicular Data Packets in a Sparse Roadside Infr...
Large-Scale Crowdsourcing by Vehicular Data Packets in a Sparse Roadside Infr...
 
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
On Performance Under Hotspots in Hadoop versus Bigdata Replay PlatformsOn Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
 
Taking the Step from Software to Product Development \\ when teaching PBL at ...
Taking the Step from Software to Product Development \\ when teaching PBL at ...Taking the Step from Software to Product Development \\ when teaching PBL at ...
Taking the Step from Software to Product Development \\ when teaching PBL at ...
 
Design and Implementation of a 3-Party Cloud-Backed Handshake for Secure Grou...
Design and Implementation of a 3-Party Cloud-Backed Handshake for Secure Grou...Design and Implementation of a 3-Party Cloud-Backed Handshake for Secure Grou...
Design and Implementation of a 3-Party Cloud-Backed Handshake for Secure Grou...
 
The Switchboard Optimization Problem and Heuristics for Cut-Through Networking
The Switchboard Optimization Problem and Heuristics for Cut-Through NetworkingThe Switchboard Optimization Problem and Heuristics for Cut-Through Networking
The Switchboard Optimization Problem and Heuristics for Cut-Through Networking
 
The Switchboard Traffic Engineering Problem for Mixed Contention/Cut-Through ...
The Switchboard Traffic Engineering Problem for Mixed Contention/Cut-Through ...The Switchboard Traffic Engineering Problem for Mixed Contention/Cut-Through ...
The Switchboard Traffic Engineering Problem for Mixed Contention/Cut-Through ...
 
Bulk-n-Pick Method for One-to-Many Data Transfer in Dense Wireless Spaces
Bulk-n-Pick Method for One-to-Many Data Transfer in Dense Wireless SpacesBulk-n-Pick Method for One-to-Many Data Transfer in Dense Wireless Spaces
Bulk-n-Pick Method for One-to-Many Data Transfer in Dense Wireless Spaces
 
Fog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
Fog Cloud Caching at Network Edge via Local Hardware Awareness SpacesFog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
Fog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
 
On a Hybrid Packets-and-Circuits Switching Logic
On a Hybrid Packets-and-Circuits Switching LogicOn a Hybrid Packets-and-Circuits Switching Logic
On a Hybrid Packets-and-Circuits Switching Logic
 
Image-Related Uses for Roadside Infrastructure \\ based on Wireless Beacons
Image-Related Uses for Roadside Infrastructure \\ based on Wireless BeaconsImage-Related Uses for Roadside Infrastructure \\ based on Wireless Beacons
Image-Related Uses for Roadside Infrastructure \\ based on Wireless Beacons
 
Complexity Resolution Control for Context Based on Metromaps
Complexity Resolution Control for Context Based on MetromapsComplexity Resolution Control for Context Based on Metromaps
Complexity Resolution Control for Context Based on Metromaps
 
The Declarative-Coordinated Model for Self-Optimization of Service Networks
The Declarative-Coordinated Model for Self-Optimization of Service NetworksThe Declarative-Coordinated Model for Self-Optimization of Service Networks
The Declarative-Coordinated Model for Self-Optimization of Service Networks
 
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
 
Towards Social Robotics on Smartphones with Simple XYZV Sensor Feedback
Towards Social Robotics on Smartphones with Simple XYZV Sensor FeedbackTowards Social Robotics on Smartphones with Simple XYZV Sensor Feedback
Towards Social Robotics on Smartphones with Simple XYZV Sensor Feedback
 
Back to Rings but not Tokens: Physical and Logical Designs for Distributed Fi...
Back to Rings but not Tokens: Physical and Logical Designs for Distributed Fi...Back to Rings but not Tokens: Physical and Logical Designs for Distributed Fi...
Back to Rings but not Tokens: Physical and Logical Designs for Distributed Fi...
 
Browser Visualization using PNGs Generated by HTML5 Workers on Multicore
Browser Visualization using PNGs Generated by HTML5 Workers on MulticoreBrowser Visualization using PNGs Generated by HTML5 Workers on Multicore
Browser Visualization using PNGs Generated by HTML5 Workers on Multicore
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

A New Practical Design for Browsable Over-the-Network Indexing

  • 1.
  • 2. . The Over-the-Network Problem M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 2/18 ... 2/18
  • 3. . Over-the-Network Problem Data Indexer Index Network Traditional Client Data Indexer IndexRead, Write Stringex Client The M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 3/18 ... 3/18
  • 4. . Everything is Over-the-Network • ... in clouds • ... inside data centers • ... in home networks . When running over-the-network .. . ... the biggest problem is that there is a hard physical limit to throughput M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 4/18 ... 4/18
  • 5. . The "Best" Tools Today M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 5/18 ... 5/18
  • 6. . The Closests Tools 1. Lucene running locally only 2. Google Data APIs, that allow for shared control ◦ not really indexing, through 3. .... that's pretty much it! M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 6/18 ... 6/18
  • 7. . Target Applications M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 7/18 ... 7/18
  • 8. . Target Applications Data Indexer Index Stringex Client The • server-less applications (read: fully distributed) • large-scale crowdsourcing connected via cloud storage • distributed storage -- the same problem • .... M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 8/18 ... 8/18
  • 9. . The Stringex Problem M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 9/18 ... 9/18
  • 10. . The Stringex Problem • a very straightforward optimization problem minimize w1ROUT + w2RIN (1) subject to (2) 0 < RIN ≤ ROUT ≤ C, (3) SLOCAL ≤ M ≤ SREMOTE, (4) NLOCAL ≤ NREMOTE ≤ NUSER, (5) • R is rate, throughput, etc. • S is storage size, can be local and remote • C and M are constants, set by user • N is number of files over which the index is split M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 10/18 ... 10/18
  • 11. . Naive Stringex Client M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 11/18 ... 11/18
  • 12. . Practical Assumptions • JSON input, only top level is indexed, otherwise stringified • several efficiency tricks 1. split index in relatively small files 2. distribute smoothly using random hashing 3. update parts on timeout -- accumulate multiple intensive updates 4. create special mapswhich allow for browsing • JSON aggregations in files : one line is base64( JSON sring) ◦ if bzip2 algorithm is within reach, you can have base64( bzip2( JSON string)) M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 12/18 ... 12/18
  • 13. . Naive Client: Data Structure INPUT JSON { name : value1, age : value2, …} Files …name .imap { bk : { ik : start,end , … next ik }, … next bk } name .vmap { value : bk , … next value } name .bk1 name .bk2 … Key: name … Key: age docs .imap { bk : { docid : start,end , … next docid }, … next bk } docs .bk1 docs .bk2 … Docs No . vmap SameSame Index Data • meta is separate from data • smart maps, lets to read/ write sections of files ◦ specifically for chunk* API in Dropbox • filenames are head 2-3 symbols of MD5 hashes M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 13/18 ... 13/18
  • 14. . Naive Client: Sync Engine Design Stringex Index Stringex Client The Sync Engine Optimization Local Cache Check 1 2 Use M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 14/18 ... 14/18
  • 15. . Evaluation M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 15/18 ... 15/18
  • 16. . Stringex vs Lucene 3.15 3.85 4.55 5.25 5.95 6.65 Index Size (log) 2.55 2.65 2.75 2.85 2.95 3.05 3.15 3.25 Throughput(logofbytes/doc) Lucene Stringex M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 16/18 ... 16/18
  • 17. . Wrapup • https://github.com/maratishe/stringex has JS client • I also have a PHP client for command line Stringex • stringex is better for browsing because items cluster naturally -- better than Lucene ◦ I use it for small browsable summaries of datasets ◦ ... and context-based browsable datasets • many other uses are possible M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 17/18 ... 17/18
  • 18. . That’s all, thank you ... M.Zhanikeev -- maratishe@gmail.com -- A New Practical Design for Browsable Over-the-Network Indexing -- http://bit.do/140428 -- 18/18 ... 18/18