There are a lot of solutions for querying JSON data available, most of which are proprietary and require a steep learning curve. Couchbase's N1QL (Non-First Normal Form Query Language) is a very powerful query language built on top of the SQL we all know and love (well, mostly love). It's really amazing how easy N1QL is for current SQL users.
In this session, we'll delve into the differences between SQL and N1QL, learning how it layers new features on top of ANSI SQL to support nested data and JSON types. We'll also go in depth into indexing JSON data using Couchbase, covering how to design and troubleshoot your indexes to drive spectacular performance at scale.
2. Who is this guy?
• Brant Burnett - @btburnett3
• Systems Architect at CenterEdge Software
• .NET since 1.0, SQL Server since 7.0
• MCSD, MCDBA
• Experience from desktop apps to large
scale cloud services
3. NoSQL Credentials
• Couchbase user since 2012 (v1.8)
• Couchbase Community Expert
• Open source contributions:
• Couchbase .NET SDK
• Couchbase.Extensions for .NET Core
• Couchbase LINQ provider (Linq2Couchbase)
• CouchbaseFakeIt
• couchbase-index-manager
5. What is Couchbase
• NoSQL document database
• Get and set documents by key
• Imagine a giant folder full of JSON files
• If you know the filename, you can get or
update the content
• Additional features:
• Query using N1QL (SQL-based)
• Map-Reduce Views
• Full Text Search
• Analytics (Preview in 5.5)
• Eventing (5.5)
• Couchbase is not CouchDB
9. What’s a Bucket?
• Large collection of JSON
documents
• Every document may have a
different schema
• Documents are accessed by a
string called the key
10. CustomerID Name DOB
CBL2015 Jane Smith 1990-01-30
Table: Customer
{
"Name": "Jane Smith",
"DOB": "1990-01-30",
"type": "customer"
}
Document Key: customer-CBL2015
11. So how do I query data from a bucket?
This Photo by Unknown Author is licensed under CC BY-SA
12. {
"Name": "John Smith",
"DOB": "1990-06-29",
"type": "customer"
}
Document Key: customer-CBL2016
SELECT Name, DOB FROM Bucket
WHERE type = 'customer' AND Name LIKE '%Smith'
ORDER BY Name
[{
"Name": "Jane Smith",
"DOB": "1990-01-30"
},
{
"Name": "John Smith",
"DOB": "1990-06-29"
}]
{
"Name": "Jane Smith",
"DOB": "1990-01-30",
"type": "customer"
}
Document Key: customer-CBL2015
13. What other SQL features are
supported?
• Aggregation (MIN, MAX, SUM, AVG, COUNT, etc)
• GROUP BY/HAVING
• OFFSET/LIMIT
• Subqueries
• UNION/INTERSECT/EXCEPT
• Joins (more details to come…)
• UPDATE/INSERT/DELETE/UPSERT
14. Accessing Nested Objects
Key: airport_3484
{
"airportname": "Los Angeles Intl",
"city": "Los Angeles",
"country": "United States",
"faa": "LAX",
"geo": {
"alt": 126,
"lat": 33.942536,
"lon": -118.408075
},
"icao": "KLAX",
"id": 3484,
"type": "airport",
"tz": "America/Los_Angeles"
}
SELECT *
FROM `travel-sample`
WHERE type = 'airport' AND geo.alt < 1000
SELECT *
FROM `travel-sample`
WHERE type = 'route'
AND schedules[0].day = 1
15. O backtick, backtick!
Wherefore art thou a
backtick?
• ANSI SQL delimits identifiers with double quotes
• SELECT * FROM "table-name"
• T-SQL also delimits identifiers with square
brackets
• SELECT * FROM [table-name]
• Both of these are used in JSON!
• {"array": ["string1", "string2"]}
• So, N1QL uses the backtick instead
• SELECT * FROM `bucket-name`
This Photo by Unknown Author is licensed under CC BY-NC-ND
17. Strings
Supported by JSON
Collation is always
case sensitive
1
Literals are delimited with
either double or single
quotes
x = 'my string here’
x = "my string here"
2
Various supporting
functions
•String concatenation (||)
•LENGTH
•LOWER
•CONTAINS
•TRIM, etc…
3
18. Numbers
Supported by JSON
1
Literals are included
without delimiters
x = 123456.05124
2
Various supporting
functions
• Arithmetic operators
• ABS
• CEIL
• SQRT
• TRUNC, etc...
3
20. Arrays
Supported by JSON
1
Literals are comma
delimited and surrounded
by square brackets
[1, 2, 3, "a"]
2
Various supporting
functions
• subqueries
• ARRAY_CONTAINS
• ARRAY_AVG
• ARRAY_INSERT
• ARRAY_LENGTH, etc…
3
21. Objects
Supported by JSON
1
Literals are comma
delimited key/value pairs
surrounded by curly
braces
{"key": "value"}
2
Various supporting
functions
• OBJECT _NAMES
• OBJECT_PAIRS
• OBJECT_VALUES, etc…
3
22. Nulls
Supported by JSON
1
Literal is the word null,
no delimiters
{"key": null}
2
Various supporting
operators and
functions
• IS NULL
• IS NOT NULL
• IFNULL, etc…
3
23. Missing attributes
Supported by JSON
Can’t be explicitly
declared
Similar to undefined in
Javascript
1
No literal, simply don’t
include an attribute in
an object
{}
2
Various supporting
operators and functions
• IS MISSING
• IS NOT MISSING
• IFMISSING
• IFMISSINGORNULL, etc…
3
24. Date/times
Not officially supported by
JSON
Can be stored using other data
types
Usually either ISO8601 string
or number of milliseconds
since the Unix epoch
1
Literal depends on the data
type
"2018-04-06T19:26:29.000Z"
1528140389000
2
Various supporting functions
• STR_TO_MILLIS
• CLOCK_MILLIS, ClOCK_STR
• DATE_PART_STR, DATE_PART_MILLIS
• DATE_DIFF_STR, DATE_DIFF_MILLIS
• etc…
3
27. Joining by Primary Key
SELECT route.sourceairport, route.destinationairport, airline.name
FROM `travel-sample` AS route
INNER JOIN `travel-sample` AS airline
ON route.airlineid = META(airline).id
WHERE route.type = 'route'
ORDER BY route.sourceairport, route.destinationairport, airline.name
29. Joining by Attributes
SELECT route.sourceairport, route.destinationairport, airline.name
FROM `travel-sample` AS route
INNER JOIN `travel-sample` AS airline
ON route.airline = airline.iata AND airline.type = 'airline'
WHERE route.type = 'route'
ORDER BY route.sourceairport, route.destinationairport, airline.name
31. Flattening Embedded Lists
SELECT route.sourceairport, route.destinationairport, schedule.utc
FROM `travel-sample` AS route
UNNEST route.schedule AS schedule
WHERE route.type = 'route' AND schedule.day = 0
ORDER BY route.sourceairport, route.destinationairport, schedule.utc
32. My data’s not flat, why are my queries?
This Photo by Unknown Author is licensed under CC BY-SA
34. Nesting (a.k.a. LINQ GroupJoin)
SELECT
airport.*,
(SELECT RAW r2.destinationairport FROM routes AS r2) AS destinations
FROM `travel-sample` AS airport
INNER NEST `travel-sample` AS routes
ON airport.faa = routes.sourceairport AND routes.type = 'route'
WHERE airport.type = 'airport'
AND airport.airportname LIKE 'Los Angeles%'
38. Single Attribute Index
CREATE INDEX docsByName
ON bucket (name)
SELECT * FROM bucket
WHERE name LIKE 'A%'
SELECT * FROM bucket
WHERE name >= 'A' AND name < 'N'
39. Multiple Attribute Index
CREATE INDEX docsByNames ON bucket
(lastName, firstName)
SELECT * FROM bucket
WHERE lastName LIKE 'A%'
SELECT * FROM bucket
WHERE lastName = 'Burnett'
AND firstName LIKE ‘B%'
40. Expression Index
CREATE INDEX docsByName ON bucket
(LOWER(lastName), LOWER(firstName))
SELECT * FROM bucket
WHERE LOWER(lastName) LIKE 'a%'
SELECT * FROM bucket
WHERE LOWER(lastName) = 'burnett'
AND LOWER(firstName) LIKE 'b%'
41. Filtered Index
CREATE INDEX custsByName ON bucket
(LOWER(lastName), LOWER(firstName))
WHERE type = 'customer'
SELECT * FROM bucket
WHERE LOWER(lastName) LIKE 'a%'
AND type = 'customer'
SELECT * FROM bucket
WHERE LOWER(lastName) = 'burnett'
AND LOWER(firstName) LIKE 'b%'
AND type = 'customer'
42. Array Index
CREATE INDEX custsByNickName ON bucket
(DISTINCT ARRAY p FOR p IN nickNames END)
WHERE type = 'customer’
SELECT * FROM bucket
WHERE ANY p IN nickNames SATISFIES p = 'Buzz' END
AND type = 'customer'
CREATE INDEX custsByNickName ON bucket
(DISTINCT ARRAY LOWER(p) FOR p IN nickNames END)
WHERE type = 'customer’
SELECT * FROM bucket
WHERE ANY p IN nickNames SATISFIES LOWER(p) = 'buzz' END
AND type = 'customer'
43. Index Nodes
Node B
Index Architecture
Data Nodes
Node A
DCP
DCP
Index 1
Index 2
Replica
Index 3
Index 1
Replica
Index 2 Index 4
44. Deferring Index Build
CREATE INDEX docsByName
ON bucket (name)
WITH {"defer_build": true}
CREATE INDEX docsByNames
ON bucket (lastName, firstName)
WITH {"defer_build": true}
BUILD INDEX ON bucket
(docsByName, docsByNames)
45. Replicated Index
CREATE INDEX custsByName ON bucket
(LOWER(lastName), LOWER(firstName))
WHERE type = 'customer'
WITH {"num_replica": 1}
SELECT * FROM bucket
WHERE LOWER(lastName) LIKE 'a%’
AND type = 'customer'
SELECT * FROM bucket
WHERE LOWER(lastName) = 'burnett'
AND LOWER(firstName) LIKE 'b%’
AND type = 'customer'
46. Partitioned Index
CREATE INDEX custsByName ON bucket
(LOWER(lastName), LOWER(firstName))
WHERE type = 'customer'
PARTITION BY hash(tenantId)
WITH {"num_replica": 1}
SELECT * FROM bucket
WHERE LOWER(lastName) LIKE 'a%’
AND type = 'customer'
SELECT * FROM bucket
WHERE LOWER(lastName) = 'burnett'
AND tenantId = 123456
AND type = 'customer'
48. Index Selection Criteria
• All predicates on the index must be included in
the query
• The first index expression must be in the
predicate
• Chooses the index with the most matching
expressions
• If more than one option, chooses one at random
for load balancing
• Does not use statistics for optimization (yet…)
49. Query Node
Query Process (a simplified subset)
Data Nodes
Index Node
1. Incoming Query 7. Query Result
2. Query Plan
7. Filter, Sort, Agg, etc
50. Live Demo!
This should be interesting…
This Photo by Unknown Author is licensed under CC BY-NC-SA
51. Nested Loop vs Hash Join in C#
Nested Loop Join
IEnumerable<RouteAirlines> Join(
IList<Route> routes, IList<Airline> airlines)
{
foreach (var route in routes)
{
var routeAirlines = new RouteAirlines
{
Route = route,
Airlines = new List<Airline>()
};
foreach (var airline in airlines)
{
if (airline.Iata == route.Airline) {
routeAirlines.Add(airline);
}
}
yield return routeAirlines;
}
}
Hash Join
IEnumerable<RouteAirlines> Join(
IList<Route> routes, IList<Airline> airlines)
{
var hashTable = airlines.ToLookup(p => p.Iata);
foreach (var route in routes)
{
var routeAirlines = new RouteAirlines
{
Route = route,
Airlines = hashTable[route.Airline].ToList()
};
yield return routeAirlines;
}
}
52. N1QL Hash Join
SELECT route.sourceairport, route.destinationairport, airline.name
FROM `travel-sample` AS route
INNER JOIN `travel-sample` AS airline USE HASH(build)
ON route.airline = airline.iata AND airline.type = 'airline'
WHERE route.type = 'route'
ORDER BY route.sourceairport, route.destinationairport, airline.name
53. Key Optimization Takeaways
Make sure fetch
is no larger than
necessary
1
Design covering
indexes where
possible
2
Watch out for
pagination
3
Consider USE
HASH where
applicable
4
Keep joins to a
minimum
5
Scalability – Multi node, auto-sharded architecture makes it easy to scale out horizontally
Availability – Multi node architecture makes high availability easy
COUCH = Cluster of Unreliable Commodity Hardware
Agility – JSON documents without schema enforcement makes it easy for teams to iterate quickly
Non-first normal form query language
Think millions of documents
Schema is not enforced by the DB
Let’s see how to represent customer data in JSON.
The primary (CustomerID) becomes the DocumentKey
Column name-Column value becomes KEY-VALUE pair.
What if I wanted to filter to airports with an altitude less than 1000?
Just use Javascript dot notation to access attributes at any depth
You may also use Javascript square bracket array notation to access items in arrays by index
Non-first normal form query language
You can use the LOWER function to avoid case-sensitive collation
String concatenation is one difference from SQL, uses double vertical bars. Since we can’t know the type in advance, we need a separate concat operator from the addition operator
Note that array elements don’t necessarily have to be of the same type, though they usually are
Non-first normal form query language
First animation: note that we use an alias on the bucket name. This prevents confusion when we’re getting multiple document types from the same bucket.
Second animation: note that we’re using META().id to get the primary key of the document to join
Also, note that this syntax is only available in Couchbase Server 5.5
But what if I want to join based on an attribute instead of the primary key?
The type filter on the second extent should be part of the ON clause, not the WHERE clause
There must be an index to support looking up the second extent based on these clauses
Not as performant as a join based on primary key, which doesn’t need an index at all
Embedding an array inside a document creates an implicit 1:N relationship between the root document and the items in the array. But how do I join across this relationship?
The type filter on the second extent should be part of the ON clause, not the WHERE clause
There must be an index to support looking up the second extent based on these clauses
Not as performant as a join based on primary key, which doesn’t need an index at all
Note that we want to know all the routes for a set of airports. In traditional SQL, we’d have to flatten the output, repeating the airport data for every matching route.
Nesting is analogous to GroupJoin in LINQ, where all matching documents are returned in an array
We’re using an additional subquery on the array in the select projection to reduce the data we’re returning
There is also LEFT OUTER NEST
Nesting is analogous to GroupJoin in LINQ, where all matching documents are returned in an array
We’re using an additional subquery on the array in the select projection to reduce the data we’re returning
There is also LEFT OUTER NEST
Indexes every document in the bucket by the primary key
Supports any query, but with poor performance
Kind of like a table scan in SQL, except it scans every document in the entire bucket
Not recommend for production, except some very specific use cases
Automatically excludes any documents where “name” is MISSING
The attribute must be included in the predicate for the index to be used, just like a SQL index
Automatically excludes any documents where “lastName” or “firstName” is MISSING
At least the first attribute must be included in the predicate for the index to be used, just like a SQL index
The second attribute will be used if possible, and so on for multiple attributes
Can use any deterministic function to adjust attributes before they are indexed
Predicates must use the same function to match the index
Still excludes MISSING lastName and firstName, since LOWER(MISSING) = MISSING
Can include any quantity of deterministic predicates
Requires that queries must include the same predicates (all of them!) in order to match the index
Because query planning occurs before parameter substitution, the type = ‘customer’ clause cannot be parameterized
Any expression in the index definition can be an ARRAY clause, though only one array is allowed
Includes all values in the array, so long arrays can significantly increase index size
Animation: You can also use functions as part of the array clause
DCP streams mutations (inserts, updates, and deletes) to the index nodes
Streaming is async, thus indexes are eventually consistent
High availability and load balancing is provided by having replicas on more than one node, each replica is a full copy of the index
Any given query only accesses one copy of the index on one node, avoiding scatter/gather for low latency
When building an index, streams the entire bucket from the data nodes to the index node
Only one index build can be running at a time
By building more than one index with a single BUILD command, we can share the stream
Creates two complete copies of the index, on two different index nodes
Provides HA and load balancing
New in 5.5, only available in enterprise edition
Spreads the index across all nodes in the cluster (optionally a subset of nodes), deciding which node receives which part of the index based on a deterministic hash of the referenced attribute
Good for particularly large indexes, as index can now scale horizontally
Creates a scatter/gather situation which can increase latency, but that is eliminated if you include a equality predicate for the hashed attribute so it can go to just one node
Note that if the index contains all data needed by the query, it will “cover” the query, meaning steps 4 and 5 are skipped
Key to optimizing this process is to reduce waste in step 4, avoid having documents returned from the index that are then thrown out by Step 7
Oversimplification, but delivers the concept
Which one of these do you think is most efficient?
Depends on the relative sizes of the two lists
Short first list, isn’t worth the time to build the hash table
Normal attribute join uses an inner loop, which is inefficient if the left hand extent has lots of data and the right hand side is small repeating set
Hash join is an optimization automatically selected by RDBMS implementations, but must be manually chosen in N1QL
Builds a hash table of all possible matches on the right hand extent, and uses the hash table when processing the left hand extent
Use “probe” instead of “build” to build the hash table on the left side instead of the right (should be the smaller set)
Only available on 5.5 Enterprise Edition (free for dev, but costs for production, includes support)