noSQL databases: MongoDB

Dario Malchiodi

Université Sophia-Antipolis, May 29, 2018

Organize data exploiting relations (that is, sets of tuples)

Among the most used:

Oracle
MySQL
PostgreSQL
SQLite

noSQL DBMS

Again, a wide variety:

MongoDB
CouchDB
Cassandra
Redis

This post is a good starting point in order to understand how they differ.

Why noSQL?

Because the world is messy!

Data often happen not to follow a predefined scheme
specifically, in the data science domain

Incidentally, noSQL gives us interesting advantages:

dealing with complex structures at once
scaling horizontally (sharding)
robustness through redundancy

Enter MongoDB

MongoDB (http://www.mongodb.com) is a document-oriented open source noSQL DBMS.

Developed since 2007
In production since the 1.4 release (march 2010)
Latest stable release: 3.6.5, released on may 2018
(the name refers to «humongous»)

Enter MongoDB

MongoDB has, however, some critical points:

indices are organized using B-trees requiring a lot of RAM
databases tend to take more disk space than their omologous on a RDBMS
no transactions support is provided

A db agregates tables
A table aggregates rows
A row aggregates values for (fixed) attributes
Tables can be joined
A schema fixes how rows in a table are structured

A db agregates tables collections
A table collection aggregates rows documents
A row aggregates values for (fixed) attributes (variable) fields encoded as (name, value) pairs
Tables cannot be joined (at least, not directly)
A schema fixes how rows in a table are not structured

The row of a relational database can be encoded as a sequence of fixed typed values
This is not possible in MongoDB, according to the lack of a prefixed structure in documents
This is why documents are described using the JSON format
JSON (JavaScript Object Notation, http://www.json.org) is a lightweight data-interchange format, easy to read (by humans) and to parse (by computers)

Lightweightness

Simply put:

['one', 'two', 'three']

is much more simple than

<array>
  <element>one</element>
  <element>two</element>
  <element>three</element>
</array>

JSON: Backus–Naur form

An example

First Name	Last Name	License plate
Donald	Duck	313
Mickey	Mouse	113
Goofy		114

becomes

{'first-name': 'Donald', 'last-name': 'Duck',
 'license-plate': 313},
{'first-name': 'Mickey', 'last-name': 'Mouse',
 'license-plate': 113},
{'first-name': 'Goofy', 'license-plate': 114}

Internal representation

MongoDB actually stores data using the BSON format (Binary JSON, http://bsonspec.org/), designed in order to:

efficiently serialize JSON documents
easily traverse such documents
keep lightweightness

Embedded documents

Documents can be nested:

{'first-name': 'Donald',
 'last-name': 'Duck',
 'car': {'model': 'American Bantam',
         'license-plate': 313}
}

Arrays

Arrays are first-class objects in BSON:

{'first-name': 'Donald',
 'last-name': 'Duck',
 'car': {'model': 'American Bantam',
         'license-plate': 313},
 'relatives': ['Huey', 'Dewey', 'Louie']
}

MongoDB components

`mongod`	server (listening on port 27017)
mongo	javascript client
API	for Python, Java, Scala, Ruby, C, C++, ... (http://api.mongodb.org)
installation	https://docs.mongodb.com/manual/installation

$ mongod --dbpath .
2018-05-28T08:47:04.813+0200 I CONTROL  [initandlisten] MongoDB starting : pid=5172 port=27017 dbpath=. ...
...

$ mongo
MongoDB shell version: 3.6.5
...
>

Luckily, you shouldn't need to do that for the interactive part. Ready?

We need data for the interactive part

Let's see who's who in Duckburg

Preliminary operations

import pymongo
client = pymongo.MongoClient('mongodb://localhost:27017/')
duckburg_db = client.duckburg # silently creates db
client.database_names()       # it's empty, thus only default dbs will show up

['admin', 'config', 'local']

ducks = duckburg_db.ducks       # silently creates collection within db
duckburg_db.collection_names()  # it will not show up, either

[]

Insertion of documents in a collection

The easiest way to put one document in a db is through the insert_one method

r = ducks.insert_one({'first_name': 'Donald',
                     'last_name': 'Duck',
                     'gender': 'M',
                     'car': {'model': 'American Bantam',
                             'license_plate': 313},
                     'birth_year': 1920,
                     'first_appearance': 1934})

r

<pymongo.results.InsertOneResult at 0x7f53d8cbe6c0>

The insert_one method returns an object gathering information about the result

acknowledged is a boolean field which becomes True once the write operation has been acknowledged (there's more here, but we will not consider it)

r.acknowledged

True

inserted_id is a field containing an univocal identificator for the newly created document (we'll see this more in depth in a moment)

r.inserted_id

ObjectId('5b0c602be45bba0cb44ca461')

What happens after this first write operation?

print(client.database_names())
print(duckburg_db.collection_names())

['admin', 'config', 'duckburg', 'local']
['ducks']

A couple of technical details

maximum document size: 16MB
insertion also as an upsert (will see that later)

Let's populate the db

characters = [
    {'first_name': 'Grandma', 'last_name': 'Duck', 'gender': 'F',
     'car': {'model': 'Detroit Electric'},
     'birth_year': 1833, 'first_appearance': 1943,
     'hobbies': ['cooking', 'gardening']},
    {'first_name': 'Scrooge', 'last_name': 'McDuck', 'gender': 'M',
     'birth_year': 1867, 'first_appearance': 1947,
     'hobbies': ['finance', 'savings', 'swimming in money bins']},
    {'first_name': 'Gyro', 'last_name': 'Gearloose', 'gender': 'M',
     'first_appearance': 1952, 'hobbies': ['invention', 'study']},
    {'first_name': 'Ludwig', 'last_name': 'Von Drake', 'gender': 'M',
     'first_appearance': 1961, 'hobbies': ['study']}
]

r = ducks.insert_many(characters)
r

<pymongo.results.InsertManyResult at 0x7f53d8cd3870>

r.inserted_ids

[ObjectId('5b0c604de45bba0cb44ca462'),
 ObjectId('5b0c604de45bba0cb44ca463'),
 ObjectId('5b0c604de45bba0cb44ca464'),
 ObjectId('5b0c604de45bba0cb44ca465')]

Reading a collection I

duck = ducks.find_one()
duck

{'_id': ObjectId('5b0c602be45bba0cb44ca461'),
 'birth_year': 1920,
 'car': {'license_plate': 313, 'model': 'American Bantam'},
 'first_appearance': 1934,
 'first_name': 'Donald',
 'gender': 'M',
 'last_name': 'Duck'}

There is no guarantee about which document will be retrieved!

Note that JSON-encoded documents can be accessed using the dictionary syntax

duck['first_name']

'Donald'

Wait: _id? ObjectId?

ducks.find_one()['_id']

ObjectId('5b0c602be45bba0cb44ca461')

The `_id` field is automatically added to all documents; it contains a unique identifier to be used for indexing purposes. By default MongoDB sets it to an `ObjectId` instance

from bson.objectid import ObjectId
[ObjectId(), ObjectId(), ObjectId()]

[ObjectId('5b0c6054e45bba0cb44ca466'),
 ObjectId('5b0c6054e45bba0cb44ca467'),
 ObjectId('5b0c6054e45bba0cb44ca468')]

Reading a collection II

Selection of several documents matching a query is done through the find method

r = ducks.find()
r

<pymongo.cursor.Cursor at 0x7f53d8cbf080>

We will speak of cursors in a while, for the moment we just cast them to lists in order to see which documents are matched

list(ducks.find())

[{'_id': ObjectId('5b0c602be45bba0cb44ca461'),
  'birth_year': 1920,
  'car': {'license_plate': 313, 'model': 'American Bantam'},
  'first_appearance': 1934,
  'first_name': 'Donald',
  'gender': 'M',
  'last_name': 'Duck'},
 {'_id': ObjectId('5b0c604de45bba0cb44ca462'),
  'birth_year': 1833,
  'car': {'model': 'Detroit Electric'},
  'first_appearance': 1943,
  'first_name': 'Grandma',
  'gender': 'F',
  'hobbies': ['cooking', 'gardening'],
  'last_name': 'Duck'},
 {'_id': ObjectId('5b0c604de45bba0cb44ca463'),
  'birth_year': 1867,
  'first_appearance': 1947,
  'first_name': 'Scrooge',
  'gender': 'M',
  'hobbies': ['finance', 'savings', 'swimming in money bins'],
  'last_name': 'McDuck'},
 {'_id': ObjectId('5b0c604de45bba0cb44ca464'),
  'first_appearance': 1952,
  'first_name': 'Gyro',
  'gender': 'M',
  'hobbies': ['invention', 'study'],
  'last_name': 'Gearloose'},
 {'_id': ObjectId('5b0c604de45bba0cb44ca465'),
  'first_appearance': 1961,
  'first_name': 'Ludwig',
  'gender': 'M',
  'hobbies': ['study'],
  'last_name': 'Von Drake'}]

A more compact view of documents

Let's build a function providing a more compact visualization of ducks

def show_duck(d):
    try:
        return '{0} {1}'.format(d['first_name'], d['last_name'])
    except KeyError:
        return d['first_name']
    
show_duck(duck)

'Donald Duck'

Reading a collection: simple selectors I

Based on the existence of specific fields

[show_duck(d) for d in ducks.find({'car': {'$exists': True}})]

['Donald Duck', 'Grandma Duck']

Reading a collection: simple selectors II

Selecting specific values

[show_duck(d) for d in ducks.find({'gender': 'F'})]

['Grandma Duck']

[show_duck(d) for d in ducks.find({'first_appearance': 1961})]

['Ludwig Von Drake']

Reading a collection: simple selectors III

On the basis of a relation

[show_duck(d) for d in ducks.find({'first_appearance': {'$gt': 1950}})]

['Gyro Gearloose', 'Ludwig Von Drake']

Filtering more than one value for a field

[show_duck(d) for d in ducks.find({'first_appearance': {'$in': [1947, 1961]}})]

['Scrooge McDuck', 'Ludwig Von Drake']

Reading a collection: complex selectors I

Logical conjunction

[show_duck(d) for d in ducks.find({'first_appearance': {'$lt': 1950}, 
                                   'gender': 'M'})]

['Donald Duck', 'Scrooge McDuck']

Logical disjunction

[show_duck(d) for d in ducks.find({'$or': [{'birth_year': {'$gt': 1900}}, 
                                           {'gender': 'F'}]})]

['Donald Duck', 'Grandma Duck']

Reading a collection: complex selectors II

Array contents

[show_duck(d) for d in ducks.find({'hobbies': 'study'})]

['Gyro Gearloose', 'Ludwig Von Drake']

[show_duck(d) for d in ducks.find({'hobbies': ['study']})]

['Ludwig Von Drake']

[show_duck(d) for d in ducks.find({'hobbies.1': 'study'})]

['Gyro Gearloose']

Reading a collection: complex selectors II

Embedded documents

[show_duck(d) for d in ducks.find({'car.model': 'American Bantam'})]

['Donald Duck']

Reading collections: complex selectors III

Regular expressions

import re
regx = re.compile("uck$", re.IGNORECASE)

[show_duck(d) for d in ducks.find({'last_name': regx})]

['Donald Duck', 'Grandma Duck', 'Scrooge McDuck']

Beware of flexibility

Using regular expressions, as well as ` $lt` or `$ gt`, may lead to inefficiency in query execution.

Projections I

Implemented via a second argument to `find`

either specifying fields to be shown...

list(ducks.find({'hobbies.1': 'study'},
                {'first_name': 1, 'last_name': 1}))

[{'_id': ObjectId('5b0c604de45bba0cb44ca464'),
  'first_name': 'Gyro',
  'last_name': 'Gearloose'}]

...or fields to be excluded.

list(ducks.find({'hobbies.1': 'study'},
                {'_id': 0, 'gender': 0, 'birth_year': 0,
                 'first_appearance': 0, 'hobbies': 0}))

[{'first_name': 'Gyro', 'last_name': 'Gearloose'}]

The select/exclude options are mutually exclusive.
Only exception: the `_id` field can be filtered out:

list(ducks.find({'hobbies.1': 'study'},
                {'first_name': 1, 'last_name': 1, '_id': 0}))

[{'first_name': 'Gyro', 'last_name': 'Gearloose'}]

Indicization I

Collections can be indexed specifying a set of fields and the corresponding sort order (the syntax used by is in this case different from the original API)

ducks.create_index([('first_name', pymongo.ASCENDING),
                    ('first_appearance', pymongo.DESCENDING)])

'first_name_1_first_appearance_-1'

Indicization II

If a query is covered by an index, that is

all filtered fields,
and all displayed fields

are within the index, the query is executed via index traversal.

regx = re.compile("^G.*", re.IGNORECASE)
list(ducks.find({'first_name': regx, 'first_appearance': {'$lt': 1955}},
                {'_id': 0, 'first_name': 1, 'first_appearance': 1}))

[{'first_appearance': 1943, 'first_name': 'Grandma'},
 {'first_appearance': 1952, 'first_name': 'Gyro'}]

Cursors

find returns a cursor to the query resultset, accessible via

invocation of the next method (until StopIteration)

cursor = ducks.find()
try:
    while True:
        print(show_duck(cursor.next()))
except StopIteration:
    pass

Donald Duck
Grandma Duck
Scrooge McDuck
Gyro Gearloose
Ludwig Von Drake

positional access through __getitem__ (aka [])

cursor = ducks.find()
show_duck(cursor[1])

'Grandma Duck'

show_duck(cursor[4])

'Ludwig Von Drake'

(even going back)

show_duck(cursor[2])

'Scrooge McDuck'

list conversion, for-based iteration,

cursor = ducks.find()

for d in cursor:
    print(show_duck(d))

Donald Duck
Grandma Duck
Scrooge McDuck
Gyro Gearloose
Ludwig Von Drake

invocation of the skip, limit, sort, and count methods

cursor = ducks.find().sort('last_name', pymongo.DESCENDING)
for d in cursor:
    print(show_duck(d))

Ludwig Von Drake
Scrooge McDuck
Gyro Gearloose
Donald Duck
Grandma Duck

cursor.rewind()

<pymongo.cursor.Cursor at 0x7f53d8cea0b8>

The update method is wicked!

ducks.insert_one({'first_name': 'Magica', 'last_name': 'De Spell',
                  'first_appearance': 1961})
ducks.update({'first_name': 'Magica'}, {'first_appearance': 2000})

/home/malchiodi/apps/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: DeprecationWarning: update is deprecated. Use replace_one, update_one or update_many instead.
  This is separate from the ipykernel package so we can avoid doing imports until

{'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True}

Note a couple of things:

the used method is deprecated;
it returns a dictionary (instead of an object) summarizing the effects of the operation

Let's double-check:

ducks.find_one({'first_name': 'Magica'})

ducks.find_one({'first_appearance': 2000})

{'_id': ObjectId('5b0c61c3e45bba0cb44ca469'), 'first_appearance': 2000}

This (deprecated) form of update completely replaced the document!

Solution

Use the update_one and update_many methods

# OK, let's revert things
ducks.find_one_and_delete({'first_appearance': 2000})
ducks.insert_one({'first_name': 'Magica', 'last_name': 'De Spell',
                  'first_appearance': 1961})

ducks.update_one({'first_name': 'Magica'}, {'first_appearance': 2000})

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-46-6a0ad72fdd9c> in <module>()
      4                   'first_appearance': 1961})
      5 
----> 6 ducks.update_one({'first_name': 'Magica'}, {'first_appearance': 2000})

~/apps/anaconda3/lib/python3.6/site-packages/pymongo/collection.py in update_one(self, filter, update, upsert, bypass_document_validation, collation)
    887         """
    888         common.validate_is_mapping("filter", filter)
--> 889         common.validate_ok_for_update(update)
    890         with self._socket_for_writes() as sock_info:
    891             result = self._update(sock_info, filter, update, upsert,

~/apps/anaconda3/lib/python3.6/site-packages/pymongo/common.py in validate_ok_for_update(update)
    437     first = next(iter(update))
    438     if not first.startswith('$'):
--> 439         raise ValueError('update only works with $ operators')
    440 
    441 

ValueError: update only works with $ operators

WTF(irst_appearance)?

This error is due to the fact that these methods don't allow to implicitly replace a whole document
Instead, the use of $set, which actually updates one or more fields leaving the rest of the document untouched, is enforced

r = ducks.update_one({'first_name': 'Magica'}, {'$set': {'first_appearance': 2000}})
r

<pymongo.results.UpdateResult at 0x7f53d8cd3c60>

The operation returns an object gathering information about the result:

# n. of matched documents
print(r.matched_count)

1

# n. of modified documents
r.modified_count

1

# None if no upsertion, id of created document otherwise
# this deals with upsertions, will be explained later on
r.upserted_id

# results in the update-style
r.raw_result

{'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True}

We can check that the method behaves as expected

ducks.find_one({'first_name': 'Magica'})

{'_id': ObjectId('5b0b3d92e45bba1692b3bf4d'),
 'first_appearance': 2000,
 'first_name': 'Magica',
 'last_name': 'De Spell'}

`update_one` only updates one of the identified documents, `update_many` updates all of them.

There is also the possibility of removing fields in a document, using the `$unset` operator.

ducks.update_one({'first_name': 'Magica'}, {'$unset': {'first_appearance': ''}})

<pymongo.results.UpdateResult at 0x7f53d8ced240>

ducks.find_one({'first_name': 'Magica'})

{'_id': ObjectId('5b0c61cce45bba0cb44ca46a'),
 'first_name': 'Magica',
 'last_name': 'De Spell'}

` $inc` and `$ dec` respectively increment and decrement a numeric quantity
`$push` adds elements to arrays

ducks.update_one({'first_name': 'Gyro'}, {'$push': {'hobbies': 'lamps'}})
ducks.find_one({'first_name': 'Gyro'})

{'_id': ObjectId('5b0c604de45bba0cb44ca464'),
 'first_appearance': 1952,
 'first_name': 'Gyro',
 'gender': 'M',
 'hobbies': ['invention', 'study', 'lamps'],
 'last_name': 'Gearloose'}

`$pop` extracts the last or first element in an array
`$pull` removes all instances of an element from an array

When dealing with arrays, `$` is a special operator denoting the index of the first occurrence of a value matched in an update query

ducks.update_one({'hobbies': 'lamps'}, {'$set': {'hobbies.$': 'robotic lamps'}})
ducks.find_one({'first_name': 'Gyro'})['hobbies']

['invention', 'study', 'robotic lamps']

Note that `$` can be used also in order to dig down to hierarchies. Suppose a document contains the following value for `hobbies`:

[{'name': 'invention', 'day': 'Monday'},
 {'name': 'study', 'day': 'Friday'}]

The update query modifies the day for the study hobby:

{{'hobbies.name': 'study'}, {'$set': {'hobbies.$.day': 'Tuesday'}}}

What happens if one attempts to update a non-existing document?

nothing...
...unless the `upsert=True` additional argument has been specified in `update_one` or `update_many`: in this case the operation results in an insert-upon-update (upsert).

at most one document identified by a query

ducks.delete_one({'first_name': 'Ludwig'})

all documents identified by a query

ducks.delete_many({'first_name': 'Ludwig'})

all documents in a collection

ducks.delete_many()

a collection

ducks.drop()

a database

client.drop_database('duckburg')

# or

duckburg_db.command('dropDatabase')

Scalability in MongoDB is achieved through sharding, that is distributing and replicating data across several machines
This however doesn't allow to perform joins automatically

`ObjectId`s can be used to manually recreate one-to-many joins...

donald_id = ducks.find_one({'first_name': 'Donald'})['_id']
for d in ['Huey', 'Dewey', 'Louie']:
    ducks.insert_one({'first_name': d, 'uncle': donald_id})

...at the price of an additional query (to recover, in this case, the uncle).

Many-to-many joins can be implemented

using arrays

ducks.insert_one({'first_name': 'Donald', 'last_name': 'Duck',
                  'relatives': [ObjectId('516405b8b356cd6125b74e89'),
                                ObjectId('516405b8b356cd6125b74e8a'),
                                ObjectId('516405b8b356cd6125b74e8b')]})
ducks.find({'relatives': ObjectId('516405b8b356cd6125b74e8a')})

or embedded documents

ducks.insert_one({'first_name': 'Donald', 'last_name': 'Duck',
                  'relatives': [{'first_name': 'Huey'},
                                {'first_name': 'Dewey'},
                                {'first_name': 'Louie'}]})
ducks.find_one({'relatives.first_name': 'Dewey'})

a possible solution (duplication not to be necessarily demonized)
critical limit: the maximal size of each document is 16 MBytes
(aside note: the whole text of Shakespeare's Hamlet requires 200 Kbytes)
the actual price to be paid for this is in having a higher number of queries (for instance, when duplicate data are updated)

There's no universal solution: the choice depends on the problem under study, as well as on personal design style

Sharding
Map-Reduce
Aggregation framework
...

Download the notebook and dataset in a computer having mongodb and jupyter installed
Unzip the dataset and open the notebook
Solve the proposed challenges adding cells just below each question
Save your results in a notebook named using your name followed by "-mongodb"

The following cells contain some challenges: solve them and send the results to malchiodi@di.unimi.it, specifying Sophia Antipolis lecture assignment 1 as subject.

You can consult with classmates but what you send should be the exclusive result of your work, and in any case do not copy code.

Import the dataset¶

We will work with a book catalog already encoded in JSON format and available at the URL https://github.com/ozlerhakan/mongodb-json-files/blob/master/datasets/catalog.books.json

You don't need to download this dataset, as it is already in the Docker image we are using. You need however to import it into MongoDB: the next cell takes care of this task, creating a books collection within a db having the same name.

%%bash
mongoimport --db books --collection books --file catalog.books.json

2018-05-28T22:19:45.687+0200	connected to: localhost
2018-05-28T22:19:45.716+0200	imported 431 documents

Inspect data¶

Although we saw that a tabular description might not be the best resource to be used when visualizing noSQL data, we can build and show a Pandas dataframe as a quick way to peek into the dataset.

db = client.books
books = db.books

import pandas as pd
pd.DataFrame(list(books.find().limit(5)))

How many books are there?
Which is the highest number of co-authors in a book?
How many distinct authors are there?
Which author has the highest number of coauthors?
How many coauthors has this author and who are these coauthors?
Deduplicate author
Remove grouped authors
Write (but don't execute!) a query removing books with null authors
Which class is used in order to store time stamps?
How many books were published in 2009?
Which are the least and the most recent publication years?
Display a bar plot showing the number of all published books between 1992 and 2015
Build a function get_coauthors_for_years taking two years as arguments and returning a dictionary associating each author to the set of his coauthors within the corresponding time span (if any of the arguments is None the search should be done on all books)
Use the function get_coauthors_for_years in order to draw the collaboration graph of all books published between 1997 and 1999 (Hint: use the nextworkx.Graph class, specifically its add_node and add_edge methods, as well as the networkx.draw method in order to show a graph)
Build and save the collaboration matrix of all authors
Build and save the collaboration matrix of all books
Build and save additional information

How many books are there?¶

Which is the highest number of co-authors in a book?¶

How many distinct authors are there?¶

Which author has the highest number of co-authors?¶

Note: empty string do not count as authors

How many coauthors has this author?¶

Who are these coauthors?¶

Deduplicate author¶

It seems our data are not that clean: one author is actually the empty string, Kimberly Tripp occurs twice (one time with the middle name, another one without it), and in two cases the author's name is actually the name for a group of authors. Let's clean up these things. First of all we can deduplicate records referring to Kimberly Tripp: write an update query modifying all occurrences of 'Kimberly L. Tripp' into 'Kimberly Tripp'

Remove grouped authors¶

Write an update query removing all occurrences of 'Contributions from 53 SQL Server MVPs' and Contributions from 53 SQL Server MVPs; Edited by Paul Nielsen as author names, replacing them with the empty string.

Write (but don't execute!) a query removing books with missing author¶

Write a query which would remove all documents having the empty string as author. Don't execute the query, as it will remove interesting information we will need later on.

Which class is used in order to store time stamps?¶

How many books were published in 2009?¶

Which are the least and the most recent publication years?¶

Display a bar plot showing the number of all published books between 1992 and 2015¶

Build a function `get_coauthors_for_years` taking two years as arguments and returning a dictionary associating each author to the set of his coauthors within the corresponding time span (if any of the arguments is `None` the search should be done on all books)¶

Use the function `get_coauthors_for_years` in order to draw the collaboration graph of all books published between 1997 and 1999¶

The collaboration graph here is intended as a graph having authors as nodes and connecting two nodes if the corresponding authors have coauthored at least one book.

Hint: use the nextworkx.Graph class, specifically its add_node and add_edge methods, as well as the networkx.draw method in order to show a graph.

Build and save the collaboration matrix of all authors¶

Consider all books regardless of publication year, and build the authors' collaboration matrix, that is a square matrix whose dimension equals the number of authors. Entry $(r, c)$ in this matrix (where $r$ and $c$ refer respectively to rows and columns) should be 0 if $r$ -th and $c$ -th author never coauthored a book, otherwise it should be set to $\frac{1}{n_c}$ , where $n_c$ is the total number of coauthors of $c$ -th author.

We'll be interested actually only in non-null entries: save them as one row of a CSV file using the format $(r, c, m_{rc})$ , being $m_{rc}$ the non-null element in row $r$ and column $c$ . Use the name author-connections.csv for this file.

Hint: if you get confused about what to place in rows and columns, take in mind that the resulting matrix should be columns-wise stochastic, that is the sum of all elements in any columns has to be equal to 1.

Build and save the collaboration matrix of all books¶

The collaboration matrix of books is obtained in the same way as those of authors, where now two books are connected (and thus identify a non-null entry in the matrix) if they share at least one author. Also in this case, the obtained matrix should be column-wise stochastic.

Build and save additional information¶

Write in a CSV file the association of each book to the corresponding indices of rows and columns in the previously produced book connection matrix: each row of the file will have the format $(i,t)$ , where $i$ is the index and $t$ is the title. Name this file 'book-index.csv'

Write a file 'book-categories.csv' in which each line has the format $(i, j)$ , where $i$ is an index of the book (in the previously produced matrix and vector) and $j$ is 1 if the corresponding book has the Java category in the dataset, and 0 otherwise.

	_id	authors	categories	isbn	longDescription	pageCount	publishedDate	shortDescription	status	thumbnailUrl	title
0	2	[W. Frank Ableson, Robi Sen]	[Java]	1935182722	When it comes to mobile apps, Android can do a...	592	2011-01-14	Android in Action, Second Edition is a compreh...	PUBLISH	https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....	Android in Action, Second Edition
1	3	[Gojko Adzic]	[Software Engineering]	1617290084	NaN	0	2011-06-03	NaN	PUBLISH	https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....	Specification by Example
2	1	[W. Frank Ableson, Charlie Collins, Robi Sen]	[Open Source, Mobile]	1933988673	Android is an open source mobile phone platfor...	416	2009-04-01	Unlocking Android: A Developer's Guide provide...	PUBLISH	https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....	Unlocking Android
3	4	[Tariq Ahmed with Jon Hirschi, Faisal Abid]	[Internet]	1933988746	New web applications require engaging user-fri...	576	2009-02-02	NaN	PUBLISH	https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....	Flex 3 in Action
4	6	[Satnam Alag]	[Internet]	1933988312	There's a great deal of wisdom in a crowd, but...	425	2008-10-01	NaN	PUBLISH	https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....	Collective Intelligence in Action