hands-on.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hands-On Session: Multi-model Data query languages and processing paradigms in CIKM 2020\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: Multi-model queries in ArangoDB"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1: ArangoDB Installation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "To get started, please download and install the previous community builds (e.g., v3.4.0 https://download.arangodb.com/arangodb34/index.html) of ArangoDB \n",
    "\n",
    "Or you can install the lateset version by following the official instructions if your computer satisfies the requirement of v3.7.0:\n",
    "\n",
    "* https://www.arangodb.com/docs/stable/installation.html\n",
    "\n",
    "and started the arangodb daemon with the following command.\n",
    "\n",
    "> arangod\n",
    "\n",
    "We recommend to use the ArangoDB WebUI to perform the queries, the default url is *localhost:8529*, default username is root with the empty password."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Document store : collections and documents\n",
    "\n",
    "*Relational databases* contain *tables* of *records* (as *rows*).\n",
    "\n",
    "An **ArangoDB document database** contains **collections** that contain **documents**. The documents follow the JSON format, and are usually stored in a binary format."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<img src = \"http://json.org/object.gif\">\n",
    "<img src = \"http://json.org/array.gif\">\n",
    "<img src = \"http://json.org/value.gif\">\n",
    "\n",
    "Below is an example of json document containing information of a student and corresponding scores.\n",
    "\n",
    "** Score Document**\n",
    "```\n",
    "{\"_id\":0,\"name\":\"aimee Zank\",\n",
    " \"scores\":[{\"score\":1.463179736705023,\"type\":\"exam\"},\n",
    "           {\"score\":11.78273309957772,\"type\":\"quiz\"},\n",
    "           {\"score\":35.8740349954354,\"type\":\"homework\"}]\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### 1.1 Loading the score documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# create a database in arangosh shell\n",
    "arangosh> db._createDatabase(\"handson\");\n",
    "arangosh> db._useDatabase(\"handson\");\n",
    "\n",
    "# import an example dataset in bash\n",
    "arangoimp --file scores.json --collection scores --create-collection true --server.database handson"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 Arango Query Language (AQL) on documents\n",
    "\n",
    "Basically, AQL return the results by using the following operations:\n",
    "\n",
    "  **FOR**: array iteration\n",
    "  \n",
    "  **RETURN**: results projection\n",
    "  \n",
    "  **FILTER**: results filtering\n",
    "  \n",
    "  **SORT**: result sorting\n",
    "  \n",
    "  **LIMIT**: result slicing\n",
    "  \n",
    "  **LET**: variable assignment\n",
    "  \n",
    "  **COLLECT**: result grouping\n",
    "  \n",
    "  **INSERT**: insertion of new documents\n",
    "  \n",
    "  **UPDATE**: (partial) update of existing documents\n",
    "  \n",
    "  **REPLACE**: replacement of existing documents\n",
    "  \n",
    "  **REMOVE**: removal of existing documents\n",
    "  \n",
    "  **UPSERT**: insertion or update of existing documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Create a document:\n",
    "INSERT {\n",
    "    \"_key\":\"211\",\n",
    "    \"name\": \"Chao\",\n",
    "    \"surname\": \"Zhang\",\n",
    "    \"score\": [60,80,90]\n",
    "} INTO scores\n",
    "\n",
    "# Retrieve a document:\n",
    "Return document(\"scores\",\"211\")\n",
    "\n",
    "# Update a document:\n",
    "UPDATE \"211\" WITH { score: [90,90,90] } IN scores\n",
    "\n",
    "# Delete a document:\n",
    "REMOVE { _key: \"211\" } IN scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 1: return a score document in the collection. \n",
    "\n",
    "For doc in scores Filter doc.name ==\"Leonida Lafond\" return doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Query 2: (multiple conditions) return a score document in the collection. \n",
    "For doc in scores Filter doc.name ==\"Leonida Lafond\" and doc._key=='266197464913' return doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 3: (array operator 1) find types of scores.\n",
    "For doc in scores limit 1 return doc.scores[*].type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 4: (array operator 2) find students whose exam scores are greater than 90.\n",
    "For doc in scores limit 1 return doc.scores[* Filter CURRENT.score>90].score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 5: (array operator 3) compute the average score.\n",
    "For doc in scores limit 1 return AVERAGE(doc.scores[*].score)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 6: flatten\n",
    "Return FLATTEN([ 1, 2, [ 3, 4 ], 5, [ 6, 7 ], [ 8, [ 9, 10 ] ] ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 7: sorting \n",
    "For doc in scores\n",
    "    Sort first(doc.scores[*].score) DESC\n",
    "    Return doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 8: grouping (with or without count)\n",
    "For doc in scores\n",
    "    COLLECT name=doc.name into g\n",
    "    return {name,g}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 9: define a variable using Let\n",
    "FOR doc in scores \n",
    "    LET average_score=AVERAGE(doc.scores[*].score)\n",
    "    SORT average_score DESC \n",
    "    RETURN { name:doc.name,average_score:average_score}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 10: Inner join between two collections\n",
    "    FOR doc1 in collection1\n",
    "        FOR doc2 in collection2\n",
    "            Filter doc1.id==doc2.id\n",
    "            return {doc1:doc1,doc2:doc2}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Graph store : nodes and edges\n",
    "\n",
    "An ArangoDB graph database contains a set of node collections and edge collections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Loading the example graphs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "arangosh> var examples = require(\"@arangodb/graph-examples/example-graph.js\");\n",
    "arangosh> var g = examples.loadGraph(\"knows_graph\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Traversing the graphs\n",
    "\n",
    "FOR vertex[, edge[, path]]\n",
    "  IN [min[..max]]\n",
    "  OUTBOUND|INBOUND|ANY startVertex\n",
    "  GRAPH graphName\n",
    "  [OPTIONS options]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 11: find the friends of a given person. \n",
    "\n",
    "// get a random person p\n",
    "Let p= (For person in persons Sort rand() limit 1 return person)\n",
    "\n",
    "// find the friends of p\n",
    "FOR v,e,path\n",
    "IN  1..1 any p[0]._id\n",
    "GRAPH \"knows_graph\"\n",
    "RETURN {p,v,e}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 12: Filtering\n",
    "# Filtering vertex \n",
    "// get person bob\n",
    "Let p= (For person in persons Filter person._key=='bob' return person)\n",
    "\n",
    "// find the friends of p\n",
    "FOR v,e\n",
    "IN  1..1 any p[0]._id\n",
    "GRAPH \"knows_graph\"\n",
    "Filter v._key=='alice'\n",
    "RETURN {p,v,e}\n",
    "\n",
    "# Filtering path\n",
    "// get person bob\n",
    "Let p= (For person in persons Filter person._key=='bob' return person)\n",
    "\n",
    "// find the friends of p\n",
    "FOR v,e,path\n",
    "IN  1..2 any p[0]._id\n",
    "GRAPH \"knows_graph\"\n",
    "Filter length(path.edges)>1\n",
    "RETURN {p,v,e,path}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Query 13: Graph functions -- Shortest Path\n",
    "\n",
    "// find the friends of p\n",
    "FOR v,e\n",
    "IN Any SHORTEST_PATH\n",
    "'persons/charlie' to 'persons/alice'\n",
    "GRAPH \"knows_graph\"\n",
    "RETURN {v,e}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Visualization "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 3. Your turn - exploring the movie datasets\n",
    "\n",
    "Download the IMDB dataset in the Dump and import them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# import the IMDB dataset\n",
    "arangorestore dump --server.database handson"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Questions\n",
    "(1) How many unique types of vertices and unique labels of edges are there in two collections respectively? HINT:  UNIQUE function\n",
    "\n",
    "(2) Some documents in collection imdb_vertices are associated with a \"releaseDate\" field. What is the newest movie in the collection? HINT: MAX function\n",
    "\n",
    "(3) Update a edge between \"imdb_vertices/crime\" and \"imdb_vertices/5541\" in collection imdb_edges with a label \"has_movie\", if the edge isn't exist, create one and insert it into the edge collection. HINT: keyword: UPSERT\n",
    "\n",
    "(4) For documents in collection imdb_vertices, find the ids that don't include any number, save them with a label  into a new collection named \"genre\". (HINTs: use regex expression SUBSTRING(doc._id,14)=~ \"[a-zA-Z]\", create the genre collections beforehand)\n",
    "\n",
    "(5) Find actors whose name include \"David\", return documents that have the \"birthplace\" attribute. HINT: keyword like and HAS function\n",
    "\n",
    "(6) Find the actor who have acted in the most number of movies. HINT: keyword COLLECT\n",
    "\n",
    "(7) Regarding different movie genres, find the Top-5 genres with most number of movies in all time. HINT: keyword COLLECT\n",
    "\n",
    "(8) Return the number of persons who are both actor and director. HINT: SELF-JOIN ON imdb_edges\n",
    "\n",
    "(9) Given a movie \"Forrest Gump\", check its all associated actors. return their real names and role names. HINT: graph traversal\n",
    "\n",
    "(10) Given a actor \"Tom Hanks\", find the directors who have cooperated with him more than twice. HINT: graph traversal\n",
    "HINT: graph traversal and COLLECT.\n",
    "\n",
    "(11) Think about a movie or actor you are interested in, visualize it in the ArangoDB and present some insights from the visulization."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}