hands-on.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hands-On Session for the tutorial: Multi-model Data query languages and processing paradigms in CIKM 2020\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1: ArangoDB Installation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "To better follow this hands-on session, please install the ArangoDB in advance. \n",
    "\n",
    "Please download and install the previous community builds (e.g., v3.4.0 https://download.arangodb.com/arangodb34/index.html) of ArangoDB \n",
    "\n",
    "Or you can install the lateset version by following the official instructions if your computer satisfies the requirement of v3.7.0:\n",
    "\n",
    "* https://www.arangodb.com/docs/stable/installation.html\n",
    "\n",
    "We recommend to use the ArangoDB WebUI to perform the queries, the default url is *localhost:8529*, and the default username is root with the empty password."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2 Document store in ArangoDB : collections and documents\n",
    "\n",
    "*Relational databases* contain *tables* of *records* (as *rows*).\n",
    "\n",
    "An **ArangoDB document database** contains **collections** that contain **documents**. The documents follow the JSON format, and are usually stored in a binary format.\n",
    "\n",
    "Below is an example of json document containing information of a student and corresponding scores.\n",
    "\n",
    "** Score Document**\n",
    "```\n",
    "{\"_id\":0,\"name\":\"aimee Zank\",\n",
    " \"scores\":[{\"score\":1.463179736705023,\"type\":\"exam\"},\n",
    "           {\"score\":11.78273309957772,\"type\":\"quiz\"},\n",
    "           {\"score\":35.8740349954354,\"type\":\"homework\"}]\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### 2.1  Warm-up: Loading the score documents\n",
    "\n",
    "### The score files is kept here. https://version.helsinki.fi/chzhang/cikm-2020-hands-on-session-for-multi-model-queries/-/blob/master/scores.json\n",
    "\n",
    "### You may create a collection named scores in the webUI and upload the file manully\n",
    "\n",
    "### or import the file to the server using arangoimp as follows:\n",
    "arangoimp --file scores.json --collection scores --create-collection true"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Arango Query Language (AQL) basics on documents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Basically, AQL includes the following operations:\n",
    "\n",
    "  FOR: array iteration\n",
    "  \n",
    "  RETURN: results projection\n",
    "  \n",
    "  FILTER: results filtering\n",
    "  \n",
    "  SORT: result sorting\n",
    "  \n",
    "  LIMIT: result slicing\n",
    "  \n",
    "  LET: variable assignment\n",
    "  \n",
    "  COLLECT: result grouping\n",
    "  \n",
    "  INSERT: insertion of new documents\n",
    "  \n",
    "  UPDATE: (partial) update of existing documents\n",
    "  \n",
    "  REPLACE: replacement of existing documents\n",
    "  \n",
    "  REMOVE: removal of existing documents\n",
    "  \n",
    "  UPSERT: insertion or update of existing documents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Querying the documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 1: return a score document in the collection. \n",
    "\n",
    "For doc in scores Filter doc.name ==\"Leonida Lafond\" return doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Query 2: (multiple conditions) return a score document in the collection. \n",
    "For doc in scores Filter doc.name ==\"Leonida Lafond\" and doc._key=='266197464913' return doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 3: (array operator 1) find types of scores.\n",
    "For doc in scores limit 1 return doc.scores[*].type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 4: (array operator 2) find students whose exam scores are greater than 90.\n",
    "For doc in scores limit 1 return doc.scores[* Filter CURRENT.score>90].score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 5: (array operator 3) compute the average score.\n",
    "For doc in scores limit 1 return AVERAGE(doc.scores[*].score)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 6: flatten\n",
    "Return FLATTEN([ 1, 2, [ 3, 4 ], 5, [ 6, 7 ], [ 8, [ 9, 10 ] ] ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 7: sorting \n",
    "For doc in scores\n",
    "    Sort first(doc.scores[*].score) DESC\n",
    "    Return doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 8: grouping (with or without count)\n",
    "For doc in scores\n",
    "    COLLECT name=doc.name into g\n",
    "    return {name,g}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 9: define a variable using Let\n",
    "FOR doc in scores \n",
    "    LET average_score=AVERAGE(doc.scores[*].score)\n",
    "    SORT average_score DESC \n",
    "    RETURN { name:doc.name,average_score:average_score}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 10: Inner join between two collections\n",
    "    FOR doc1 in collection1\n",
    "        FOR doc2 in collection2\n",
    "            Filter doc1.id==doc2.id\n",
    "            return {doc1:doc1,doc2:doc2}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Graph store : nodes and edges\n",
    "\n",
    "An ArangoDB graph database contains a set of node collections and edge collections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 Loading the example graphs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "arangosh> var examples = require(\"@arangodb/graph-examples/example-graph.js\");\n",
    "arangosh> var g = examples.loadGraph(\"knows_graph\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Traversing the graphs\n",
    "\n",
    "FOR vertex[, edge[, path]]\n",
    "  IN [min[..max]]\n",
    "  OUTBOUND|INBOUND|ANY startVertex\n",
    "  GRAPH graphName\n",
    "  [OPTIONS options]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 11: find the friends of a given person. \n",
    "\n",
    "// get a random person p\n",
    "Let p= (For person in persons Sort rand() limit 1 return person)\n",
    "\n",
    "// find the friends of p\n",
    "FOR v,e,path\n",
    "IN  1..1 any p[0]._id\n",
    "GRAPH \"knows_graph\"\n",
    "RETURN {p,v,e}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 12: Filtering\n",
    "# Filtering vertex \n",
    "// get person bob\n",
    "Let p= (For person in persons Filter person._key=='bob' return person)\n",
    "\n",
    "// find the friends of p\n",
    "FOR v,e\n",
    "IN  1..1 any p[0]._id\n",
    "GRAPH \"knows_graph\"\n",
    "Filter v._key=='alice'\n",
    "RETURN {p,v,e}\n",
    "\n",
    "# Filtering path\n",
    "// get person bob\n",
    "Let p= (For person in persons Filter person._key=='bob' return person)\n",
    "\n",
    "// find the friends of p\n",
    "FOR v,e,path\n",
    "IN  1..2 any p[0]._id\n",
    "GRAPH \"knows_graph\"\n",
    "Filter length(path.edges)>1\n",
    "RETURN {p,v,e,path}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 13: Graph functions -- Shortest Path\n",
    "\n",
    "// find the friends of p\n",
    "FOR v,e\n",
    "IN Any SHORTEST_PATH\n",
    "'persons/charlie' to 'persons/alice'\n",
    "GRAPH \"knows_graph\"\n",
    "RETURN {v,e}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 4. Your turn - Querying the multi-model datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1 Data description\n",
    "The data consists of three files: KnowsGraph.csv, Order.json, and Person.csv.  Specifically, Person is the tabular data with fields(id, firstName, lastname, gender, birthday, creationDate, locationIp, browserUsed). KnowsGraph is the linked data, each row is an edge starting from a PersonId to another PersonId. Order is the json data with fields (OrderId, PersonId, OrderDate,TotalPrice, [Orderline:[productId,title,price ]). Note that Orderline is an array including more than one product.\n",
    "\n",
    "Download the data here https://version.helsinki.fi/chzhang/cikm-2020-hands-on-session-for-multi-model-queries/-/tree/master/Multi-model-data and import them as follows:\n",
    "\n",
    "For the linux user, run the script:  ./import.sh. For the windows users, run the script import.bat. The scripts are under the repository https://version.helsinki.fi/chzhang/cikm-2020-hands-on-session-for-multi-model-queries/-/tree/master/."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 Hands-on experience\n",
    "  There are five tasks for querying the multi-model data as follows:\n",
    "  \n",
    "  Q1: Get the top-10 best-selling products in all orders. Hint: use the wildcard [*] to access the array and use the flatten operator to expand the sub-array.  Assume the quantity of each product in the orderline is one.\n",
    "  \n",
    "  Q2: Calculate the total cost of female’s orders in year 2008 (involve the Customer table and Order files).\n",
    "  \n",
    "  Q3: Given a start person (_key='2199023262543'), return the number of orders made by this person’s friends in 2009.\n",
    "  \n",
    "  Q4: Given PersonX (_key='2199023259756'),    and PersonY (_key='26388279077535'),   find the shortest path between them, and also return TOP-5 best-selling products for all persons in that path (including PersonX and PersonY). \n",
    "  \n",
    "  Q5: Find the top-2 persons who spend the highest amount of money in JSON orders. Then for each person, traverse her knows-graph with 3-hop to find the friends, and finally return the number of common friends for these two persons. \n",
    "\n",
    "  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}