Skip to content
Snippets Groups Projects
hands-on.ipynb 12 KiB
Newer Older
Chao Zhang's avatar
Chao Zhang committed
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Chao Zhang's avatar
Chao Zhang committed
    "# Hands-On Session for the tutorial: Multi-model Data query languages and processing paradigms in CIKM 2020\n"
Chao Zhang's avatar
Chao Zhang committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Chao Zhang's avatar
Chao Zhang committed
    "## 1: ArangoDB Installation"
Chao Zhang's avatar
Chao Zhang committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
Chao Zhang's avatar
Chao Zhang committed
    "To better follow this hands-on session, please install the ArangoDB in advance. \n",
    "\n",
    "Please download and install the previous community builds (e.g., v3.4.0 https://download.arangodb.com/arangodb34/index.html) of ArangoDB \n",
Chao Zhang's avatar
Chao Zhang committed
    "\n",
    "Or you can install the lateset version by following the official instructions if your computer satisfies the requirement of v3.7.0:\n",
    "\n",
    "* https://www.arangodb.com/docs/stable/installation.html\n",
    "\n",
Chao Zhang's avatar
Chao Zhang committed
    "We recommend to use the ArangoDB WebUI to perform the queries, the default url is *localhost:8529*, and the default username is root with the empty password."
Chao Zhang's avatar
Chao Zhang committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Chao Zhang's avatar
Chao Zhang committed
    "## 2 Document store in ArangoDB : collections and documents\n",
Chao Zhang's avatar
Chao Zhang committed
    "\n",
    "*Relational databases* contain *tables* of *records* (as *rows*).\n",
    "\n",
Chao Zhang's avatar
Chao Zhang committed
    "An **ArangoDB document database** contains **collections** that contain **documents**. The documents follow the JSON format, and are usually stored in a binary format.\n",
Chao Zhang's avatar
Chao Zhang committed
    "\n",
    "Below is an example of json document containing information of a student and corresponding scores.\n",
    "\n",
    "** Score Document**\n",
    "```\n",
    "{\"_id\":0,\"name\":\"aimee Zank\",\n",
    " \"scores\":[{\"score\":1.463179736705023,\"type\":\"exam\"},\n",
    "           {\"score\":11.78273309957772,\"type\":\"quiz\"},\n",
    "           {\"score\":35.8740349954354,\"type\":\"homework\"}]\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
Chao Zhang's avatar
Chao Zhang committed
    "### 2.1  Warm-up: Loading the score documents\n",
Chao Zhang's avatar
Chao Zhang committed
    "\n",
    "### The score files is kept here. https://version.helsinki.fi/chzhang/cikm-2020-hands-on-session-for-multi-model-queries/-/blob/master/scores.json\n",
Chao Zhang's avatar
Chao Zhang committed
    "\n",
Chao Zhang's avatar
Chao Zhang committed
    "### You may create a collection named scores in the webUI and upload the file manully\n",
    "\n",
    "### or import the file to the server using arangoimp as follows:\n",
    "arangoimp --file scores.json --collection scores --create-collection true"
Chao Zhang's avatar
Chao Zhang committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Chao Zhang's avatar
Chao Zhang committed
    "### 2.2 Arango Query Language (AQL) basics on documents"
   ]
  },
  {
Chao Zhang's avatar
Chao Zhang committed
   "cell_type": "markdown",
Chao Zhang's avatar
Chao Zhang committed
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Basically, AQL includes the following operations:\n",
Chao Zhang's avatar
Chao Zhang committed
    "\n",
Chao Zhang's avatar
Chao Zhang committed
    "  FOR: array iteration\n",
Chao Zhang's avatar
Chao Zhang committed
    "  \n",
Chao Zhang's avatar
Chao Zhang committed
    "  RETURN: results projection\n",
Chao Zhang's avatar
Chao Zhang committed
    "  \n",
Chao Zhang's avatar
Chao Zhang committed
    "  FILTER: results filtering\n",
Chao Zhang's avatar
Chao Zhang committed
    "  \n",
Chao Zhang's avatar
Chao Zhang committed
    "  SORT: result sorting\n",
Chao Zhang's avatar
Chao Zhang committed
    "  \n",
Chao Zhang's avatar
Chao Zhang committed
    "  LIMIT: result slicing\n",
Chao Zhang's avatar
Chao Zhang committed
    "  \n",
Chao Zhang's avatar
Chao Zhang committed
    "  LET: variable assignment\n",
Chao Zhang's avatar
Chao Zhang committed
    "  \n",
Chao Zhang's avatar
Chao Zhang committed
    "  COLLECT: result grouping\n",
Chao Zhang's avatar
Chao Zhang committed
    "  \n",
Chao Zhang's avatar
Chao Zhang committed
    "  INSERT: insertion of new documents\n",
Chao Zhang's avatar
Chao Zhang committed
    "  \n",
Chao Zhang's avatar
Chao Zhang committed
    "  UPDATE: (partial) update of existing documents\n",
Chao Zhang's avatar
Chao Zhang committed
    "  \n",
Chao Zhang's avatar
Chao Zhang committed
    "  REPLACE: replacement of existing documents\n",
Chao Zhang's avatar
Chao Zhang committed
    "  \n",
Chao Zhang's avatar
Chao Zhang committed
    "  REMOVE: removal of existing documents\n",
Chao Zhang's avatar
Chao Zhang committed
    "  \n",
Chao Zhang's avatar
Chao Zhang committed
    "  UPSERT: insertion or update of existing documents"
Chao Zhang's avatar
Chao Zhang committed
   ]
  },
  {
Chao Zhang's avatar
Chao Zhang committed
   "cell_type": "markdown",
   "metadata": {},
Chao Zhang's avatar
Chao Zhang committed
   "source": [
Chao Zhang's avatar
Chao Zhang committed
    "### 2.3 Querying the documents"
Chao Zhang's avatar
Chao Zhang committed
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 1: return a score document in the collection. \n",
    "\n",
    "For doc in scores Filter doc.name ==\"Leonida Lafond\" return doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Query 2: (multiple conditions) return a score document in the collection. \n",
    "For doc in scores Filter doc.name ==\"Leonida Lafond\" and doc._key=='266197464913' return doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 3: (array operator 1) find types of scores.\n",
    "For doc in scores limit 1 return doc.scores[*].type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 4: (array operator 2) find students whose exam scores are greater than 90.\n",
    "For doc in scores limit 1 return doc.scores[* Filter CURRENT.score>90].score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 5: (array operator 3) compute the average score.\n",
    "For doc in scores limit 1 return AVERAGE(doc.scores[*].score)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 6: flatten\n",
    "Return FLATTEN([ 1, 2, [ 3, 4 ], 5, [ 6, 7 ], [ 8, [ 9, 10 ] ] ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 7: sorting \n",
    "For doc in scores\n",
    "    Sort first(doc.scores[*].score) DESC\n",
    "    Return doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 8: grouping (with or without count)\n",
    "For doc in scores\n",
    "    COLLECT name=doc.name into g\n",
    "    return {name,g}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 9: define a variable using Let\n",
    "FOR doc in scores \n",
    "    LET average_score=AVERAGE(doc.scores[*].score)\n",
    "    SORT average_score DESC \n",
    "    RETURN { name:doc.name,average_score:average_score}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 10: Inner join between two collections\n",
    "    FOR doc1 in collection1\n",
    "        FOR doc2 in collection2\n",
    "            Filter doc1.id==doc2.id\n",
    "            return {doc1:doc1,doc2:doc2}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Chao Zhang's avatar
Chao Zhang committed
    "## 3. Graph store : nodes and edges\n",
Chao Zhang's avatar
Chao Zhang committed
    "\n",
    "An ArangoDB graph database contains a set of node collections and edge collections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Chao Zhang's avatar
Chao Zhang committed
    "### 3.1 Loading the example graphs"
Chao Zhang's avatar
Chao Zhang committed
   ]
  },
  {
Chao Zhang's avatar
Chao Zhang committed
   "cell_type": "markdown",
Chao Zhang's avatar
Chao Zhang committed
   "metadata": {
Chao Zhang's avatar
Chao Zhang committed
    "collapsed": true
Chao Zhang's avatar
Chao Zhang committed
   },
   "source": [
    "arangosh> var examples = require(\"@arangodb/graph-examples/example-graph.js\");\n",
    "arangosh> var g = examples.loadGraph(\"knows_graph\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Chao Zhang's avatar
Chao Zhang committed
    "### 3.2 Traversing the graphs\n",
Chao Zhang's avatar
Chao Zhang committed
    "\n",
    "FOR vertex[, edge[, path]]\n",
    "  IN [min[..max]]\n",
    "  OUTBOUND|INBOUND|ANY startVertex\n",
    "  GRAPH graphName\n",
    "  [OPTIONS options]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 11: find the friends of a given person. \n",
    "\n",
    "// get a random person p\n",
    "Let p= (For person in persons Sort rand() limit 1 return person)\n",
    "\n",
    "// find the friends of p\n",
    "FOR v,e,path\n",
    "IN  1..1 any p[0]._id\n",
    "GRAPH \"knows_graph\"\n",
    "RETURN {p,v,e}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 12: Filtering\n",
    "# Filtering vertex \n",
    "// get person bob\n",
    "Let p= (For person in persons Filter person._key=='bob' return person)\n",
    "\n",
    "// find the friends of p\n",
    "FOR v,e\n",
    "IN  1..1 any p[0]._id\n",
    "GRAPH \"knows_graph\"\n",
    "Filter v._key=='alice'\n",
    "RETURN {p,v,e}\n",
    "\n",
    "# Filtering path\n",
    "// get person bob\n",
    "Let p= (For person in persons Filter person._key=='bob' return person)\n",
    "\n",
    "// find the friends of p\n",
    "FOR v,e,path\n",
    "IN  1..2 any p[0]._id\n",
    "GRAPH \"knows_graph\"\n",
    "Filter length(path.edges)>1\n",
    "RETURN {p,v,e,path}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 13: Graph functions -- Shortest Path\n",
    "\n",
    "// find the friends of p\n",
    "FOR v,e\n",
    "IN Any SHORTEST_PATH\n",
    "'persons/charlie' to 'persons/alice'\n",
    "GRAPH \"knows_graph\"\n",
    "RETURN {v,e}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
Chao Zhang's avatar
Chao Zhang committed
    "## 4. Your turn - Querying the multi-model datasets"
Chao Zhang's avatar
Chao Zhang committed
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Chao Zhang's avatar
Chao Zhang committed
    "### 4.1 Data description\n",
    "The data consists of three files: KnowsGraph.csv, Order.json, and Person.csv.  Specifically, Person is the tabular data with fields(id, firstName, lastname, gender, birthday, creationDate, locationIp, browserUsed). KnowsGraph is the linked data, each row is an edge starting from a PersonId to another PersonId. Order is the json data with fields (OrderId, PersonId, OrderDate,TotalPrice, [Orderline:[productId,title,price ]). Note that Orderline is an array including more than one product.\n",
Chao Zhang's avatar
Chao Zhang committed
    "\n",
Chao Zhang's avatar
Chao Zhang committed
    "Download the data here https://version.helsinki.fi/chzhang/cikm-2020-hands-on-session-for-multi-model-queries/-/tree/master/Multi-model-data and import them as follows:\n",
Chao Zhang's avatar
Chao Zhang committed
    "\n",
Chao Zhang's avatar
Chao Zhang committed
    "For the linux user, run the script:  ./import.sh. For the windows users, run the script import.bat. The scripts are under the repository https://version.helsinki.fi/chzhang/cikm-2020-hands-on-session-for-multi-model-queries/-/tree/master/."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 Hands-on experience\n",
    "  There are five tasks for querying the multi-model data as follows:\n",
    "  \n",
    "  Q1: Get the top-10 best-selling products in all orders. Hint: use the wildcard [*] to access the array and use the flatten operator to expand the sub-array.  Assume the quantity of each product in the orderline is one.\n",
    "  \n",
    "  Q2: Calculate the total cost of female’s orders in year 2008 (involve the Customer table and Order files).\n",
    "  \n",
    "  Q3: Given a start person (_key='2199023262543'), return the number of orders made by this person’s friends in 2009.\n",
    "  \n",
    "  Q4: Given PersonX (_key='2199023259756'),    and PersonY (_key='26388279077535'),   find the shortest path between them, and also return TOP-5 best-selling products for all persons in that path (including PersonX and PersonY). \n",
    "  \n",
    "  Q5: Find the top-2 persons who spend the highest amount of money in JSON orders. Then for each person, traverse her knows-graph with 3-hop to find the friends, and finally return the number of common friends for these two persons. \n",
Chao Zhang's avatar
Chao Zhang committed
    "\n",
Chao Zhang's avatar
Chao Zhang committed
    "  "
Chao Zhang's avatar
Chao Zhang committed
   ]
Chao Zhang's avatar
Chao Zhang committed
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
Chao Zhang's avatar
Chao Zhang committed
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}