The document discusses several UDF plugins for Apache Drill including:
1. A blockchain format plugin that reads blockchain data and maps inputs/outputs, with plans to support currency conversions and address calculation.
2. Networking UDFs to work with IP addresses like conversion, checking private IPs, and CIDR functions.
3. Phonetic matching UDFs using algorithms like Soundex and Levenshtein distance for fuzzy string matching.
4. An XML storage plugin in development to allow reading XML data and optionally flattening attributes and structures.
2. Blockchain Format Plugin
• No dependencies
• In theory, should work with Bitcoin, Litecoin, Ethereum and other
Blockchain formats
• Currently reads fields from bitcoin blockchain
• Still very much a work in progress
3. Blockchain Format Plugin
• Reads Block, Block headers, Transactions, Transaction Headers
• Maps inputs and output transactions to a maps
4. Blockchain Format Plugin – Future
Functionality
• Calculate BTC addresses
• UDFs to convert BTC to currencies via API call
• Flatten Input and Output transactions (Multiple transactions still
needs to be fixed)
• Convert code style to follow pattern of the PCAP plugin
• Testing, testing, testing…
5. Networking UDFs
• inet_aton(<IPv4>): This function converts an IPv4 address in dotted decimal notation into an integer. This is
useful for sorting IP addresses, and reducing the amount of space that they take on disk.
• inet_ntoa(<int>): This function returns an IP in dotted decimal notation given its integer representation.
• is_private_ip(<IPv4>): Returns true if the IP address is private.
• in_network( <IPv4>, <CIDR Block>): Retunrs true if the IPv4 address is in the CIDR Block
• getAddressCount( <CIDR Block> ): Returns the number of IP addresses in a given CIDR Block
• getBroadcastAddress( <CIDR Block> ): Returns the broadcast address in dotted decimal notation from a
given CIDR block.
• getNetmask( <CIDR Block> ): Returns the netmask for a given CIDR Block
• getLowAddress( <CIDR Block> ): Returns the first IPv4 address in dotted decimal notation for a given CIDR
Block
• getHighAddress( <CIDR Block> ): Returns the last IPv4 address in dotted decimal notation for a given CIDR
Block
• urldencode( <URL> ): Decodes a URL argument
• urlencode( <URL> ): Returns a URL encoded version of the argument
7. Phonetic & Fuzzy Matching UDFs
• Series of phonetic and fuzzy matching algorithms including:
• Soundex
• Metaphone
• DoubleMetaphone
• Sounds_like()
• https://github.com/cgivre/drill-phonetic-functions
8. Phonetic & Fuzzy Matching UDFs
• Series of string distance functions
• Levenshtein Distance
• Jaro Distance
• Hamming Distance
• Longest Common Substring Distance and others
• https://github.com/cgivre/drill-phonetic-functions
9. XML Storage Plugin
• Magnus Pierre started working on this
• Uses XML stream reader
• Allows user to specify “data level” such that any data at a lower
nesting level is ignored.
10. XML Storage Plugin
• Allows optional flattening of attributes
• Allows optional flattening of nested structures
What doesn’t work yet:
• Nested data structures (IE maps/lists within maps or lists)