{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Hands-On Session for the tutorial: Multi-model Data query languages and processing paradigms in CIKM 2020\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1: ArangoDB Installation" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "To better follow this hands-on session, please install the ArangoDB in advance. \n", "\n", "You can install the lateset version by following the official instructions if your computer satisfies the requirement of v3.7.0:\n", "\n", "* https://www.arangodb.com/docs/stable/installation.html\n", "\n", "Or you may download and install the previous community builds (e.g., v3.4.0 https://download.arangodb.com/arangodb34/index.html) of ArangoDB \n", "\n", "We will use the ArangoDB WebUI to perform the queries, the default url is *localhost:8529*, and the default username is root with the empty password." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2 Document store in ArangoDB : collections and documents\n", "\n", "*Relational databases* contain *tables* of *records* (as *rows*).\n", "\n", "An **ArangoDB document database** contains **collections** that contain **documents**. The documents follow the JSON format, and are usually stored in a binary format.\n", "\n", "Below is an example of json document containing information of a student and corresponding scores.\n", "\n", "** Score Document**\n", "```\n", "{\"_id\":0,\"name\":\"aimee Zank\",\n", " \"scores\":[{\"score\":1.463179736705023,\"type\":\"exam\"},\n", " {\"score\":11.78273309957772,\"type\":\"quiz\"},\n", " {\"score\":35.8740349954354,\"type\":\"homework\"}]\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### 2.1 Warm-up: Loading the score documents\n", "\n", "### The score file is kept here. https://version.helsinki.fi/chzhang/cikm-2020-hands-on-session-for-multi-model-queries/-/blob/master/scores.json\n", "\n", "Please import the file to the server using arangoimport as follows:\n", "\n", "./arangoimport --file PATH-TO/scores.json --collection scores --create-collection true\n", "\n", "Notes:\n", "\n", "arangoimport options: https://www.arangodb.com/docs/stable/programs-arangoimport-options.html\n", "\n", "I recommend to add the arangoimport path to the system PATH variable:\n", "\n", "Default Path in Mac for arangoimport: /Applications/ArangoDB3-CLI.app/Contents/Resources\n", "\n", "Default Path in Linux for arangoimport: /usr/local/Cellar/arangodb\n", "\n", "Default Path in Windows for arangoimport: C:\\Program Files\\ArangoDB" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Querying the documents" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Query 1: return a score document in the collection. \n", "\n", "For doc in scores Filter doc.name ==\"Leonida Lafond\" return doc" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Query 2: (multiple conditions) return a score document in the collection. \n", "For doc in scores Filter doc.name ==\"Leonida Lafond\" and doc._key=='266197464913' return doc" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Query 3: (array operator 1) find types of scores.\n", "For doc in scores limit 1 return doc.scores[*].type" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Query 4: (array operator 2) find students whose exam scores are greater than 90.\n", "For doc in scores limit 1 return doc.scores[* Filter CURRENT.score>90].score" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Query 5: (array operator 3) compute the average score.\n", "For doc in scores limit 1 return AVERAGE(doc.scores[*].score)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Query 6: flatten\n", "Return FLATTEN([ 1, 2, [ 3, 4 ], 5, [ 6, 7 ], [ 8, [ 9, 10 ] ] ])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Query 7: sorting \n", "For doc in scores\n", " Sort first(doc.scores[*].score) DESC\n", " Return doc" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Query 8: grouping (with or without count)\n", "For doc in scores\n", " COLLECT name=doc.name into g\n", " return {name,g}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Query 9: define a variable using Let\n", "FOR doc in scores \n", " LET average_score=AVERAGE(doc.scores[*].score)\n", " SORT average_score DESC \n", " RETURN { name:doc.name,average_score:average_score}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Query 10: Inner join between two collections\n", " FOR doc1 in scores\n", " FOR doc2 in scores\n", " Filter doc1.name==doc2.name\n", " return {doc1:doc1,doc2:doc2}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Graph store : nodes and edges\n", "\n", "An ArangoDB graph database contains a set of node collections and edge collections." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Loading the example graphs" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "arangosh> var examples = require(\"@arangodb/graph-examples/example-graph.js\");\n", "arangosh> var g = examples.loadGraph(\"knows_graph\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Traversing the graphs\n", "\n", "FOR vertex[, edge[, path]]\n", " IN [min[..max]]\n", " OUTBOUND|INBOUND|ANY startVertex\n", " GRAPH graphName\n", " [OPTIONS options]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Query 11: find the friends of a given person. \n", "\n", "// get a random person p\n", "Let p= (For person in persons Sort rand() limit 1 return person)\n", "\n", "// find the friends of p\n", "FOR v,e,path\n", "IN 1..1 any p[0]._id\n", "GRAPH \"knows_graph\"\n", "RETURN {p,v,e}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Query 12: Filtering\n", "# Filtering vertex \n", "// get person bob\n", "Let p= (For person in persons Filter person._key=='bob' return person)\n", "\n", "// find the friends of p\n", "FOR v,e\n", "IN 1..1 any p[0]._id\n", "GRAPH \"knows_graph\"\n", "Filter v._key=='alice'\n", "RETURN {p,v,e}\n", "\n", "# Filtering path\n", "// get person bob\n", "Let p= (For person in persons Filter person._key=='bob' return person)\n", "\n", "// find the friends of p\n", "FOR v,e,path\n", "IN 1..2 any p[0]._id\n", "GRAPH \"knows_graph\"\n", "Filter length(path.edges)>1\n", "RETURN {p,v,e,path}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Query 13: Graph functions -- Shortest Path\n", "\n", "// find the friends of p\n", "FOR v,e\n", "IN Any SHORTEST_PATH\n", "'persons/charlie' to 'persons/alice'\n", "GRAPH \"knows_graph\"\n", "RETURN {v,e}" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## 4. Your turn - Querying the multi-model datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1 Data description\n", "The data consists of three files: KnowsGraph.csv, Order.json, and Person.csv. Specifically, Person is the tabular data with fields(id, firstName, lastname, gender, birthday, creationDate, locationIp, browserUsed). KnowsGraph is the linked data, each row is an edge starting from a PersonId to another PersonId. Order is the json data with fields (OrderId, PersonId, OrderDate,TotalPrice, [Orderline:[productId,title,price ]). Note that Orderline is an array including more than one product.\n", "\n", "Download the data here https://version.helsinki.fi/chzhang/cikm-2020-hands-on-session-for-multi-model-queries/-/tree/master/Multi-model-data and import them using arangoimport as follows: \n", "\n", "./arangoimport --file PATH-TO/Multi-model-data/Person.csv --type csv --translate \"id=_key\" --collection \"Person\" --server.username root --create-collection true\n", "\n", "./arangoimport --file PATH-TO/Multi-model-data/Order.json --type json --translate \"id=_key\" --collection \"Order\" --server.username root --create-collection true\n", "\n", "./arangoimport --file \"PATH-TO/Multi-model-data/KnowsGraph.csv\" --type csv --translate \"from=_from\" --translate \"to=_to\" --collection \"KnowsGraph\" --from-collection-prefix Person --to-collection-prefix Person --server.username root --create-collection true --create-collection-type edge" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 Hands-on exercises\n", " There are five tasks for querying the multi-model data as follows:\n", " \n", " Q1: Get the top-10 best-selling products in all orders. Hint: use the wildcard [*] to access the array and use the flatten operator to expand the sub-array. Assume the quantity of each product in the orderline is one.\n", " \n", " Q2: Calculate the total cost of female’s orders in year 2008 (involve the Customer table and Order files, OrderDate contains the year information).\n", " \n", " Q3: Given a start person (_key='2199023262543'), return the number of orders made by this person’s friends in 2009. (friends are the outbound vertices of the start person)\n", " \n", " Q4: Given PersonX (_key='2199023259756') and PersonY (_key='26388279077535'), find the shortest path between them, and also return TOP-5 best-selling products for all persons in that path (including PersonX and PersonY). \n", " \n", " Q5: Find the top-2 persons who spend the highest amount of money in JSON orders. Then for each person, traverse her knows-graph with 3-hop to find the friends, and finally return the number of common friends for these two persons. \n", "\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 1 }