noSQL databases: MongoDB
Dario Malchiodi
Université Sophia-Antipolis, May 29, 2018
RDBMS

Organize data exploiting relations (that is, sets of tuples)

Among the most used:

  • Oracle
  • MySQL
  • PostgreSQL
  • SQLite
noSQL DBMS

Again, a wide variety:

  • MongoDB
  • CouchDB
  • Cassandra
  • Redis

This post is a good starting point in order to understand how they differ.

Why noSQL?

Because the world is messy!

  • Data often happen not to follow a predefined scheme
  • specifically, in the data science domain

Incidentally, noSQL gives us interesting advantages:

  • dealing with complex structures at once
  • scaling horizontally (sharding)
  • robustness through redundancy
Enter MongoDB

MongoDB (http://www.mongodb.com) is a document-oriented open source noSQL DBMS.

  • Developed since 2007
  • In production since the 1.4 release (march 2010)
  • Latest stable release: 3.6.5, released on may 2018
  • (the name refers to «humongous»)
Enter MongoDB

MongoDB has, however, some critical points:

  • indices are organized using B-trees requiring a lot of RAM
  • databases tend to take more disk space than their omologous on a RDBMS
  • no transactions support is provided
In a classical DBMS
  • A db agregates tables
  • A table aggregates rows
  • A row aggregates values for (fixed) attributes
  • Tables can be joined
  • A schema fixes how rows in a table are structured
In a classical document oriented DBMS
  • A db agregates tables collections
  • A table collection aggregates rows documents
  • A row aggregates values for (fixed) attributes (variable) fields encoded as (name, value) pairs
  • Tables cannot be joined (at least, not directly)
  • A schema fixes how rows in a table are not structured
Representation of data

  • The row of a relational database can be encoded as a sequence of fixed typed values
  • This is not possible in MongoDB, according to the lack of a prefixed structure in documents
  • This is why documents are described using the JSON format
  • JSON (JavaScript Object Notation, http://www.json.org) is a lightweight data-interchange format, easy to read (by humans) and to parse (by computers)
Lightweightness

Simply put:

['one', 'two', 'three']

is much more simple than

<array>
  <element>one</element>
  <element>two</element>
  <element>three</element>
</array>
JSON: Backus–Naur form

An example
First Name Last Name License plate
Donald Duck 313
Mickey Mouse 113
Goofy 114

becomes

{'first-name': 'Donald', 'last-name': 'Duck',
 'license-plate': 313},
{'first-name': 'Mickey', 'last-name': 'Mouse',
 'license-plate': 113},
{'first-name': 'Goofy', 'license-plate': 114}
Internal representation

MongoDB actually stores data using the BSON format (Binary JSON, http://bsonspec.org/), designed in order to:

  • efficiently serialize JSON documents
  • easily traverse such documents
  • keep lightweightness
Embedded documents

Documents can be nested:

{'first-name': 'Donald',
 'last-name': 'Duck',
 'car': {'model': 'American Bantam',
         'license-plate': 313}
}
Arrays

Arrays are first-class objects in BSON:

{'first-name': 'Donald',
 'last-name': 'Duck',
 'car': {'model': 'American Bantam',
         'license-plate': 313},
 'relatives': ['Huey', 'Dewey', 'Louie']
}
MongoDB components
`mongod`server (listening on port 27017)
mongojavascript client
APIfor Python, Java, Scala, Ruby, C, C++, ... (http://api.mongodb.org)
installationhttps://docs.mongodb.com/manual/installation
Manual execution
$ mongod --dbpath .
2018-05-28T08:47:04.813+0200 I CONTROL  [initandlisten] MongoDB starting : pid=5172 port=27017 dbpath=. ...
...

$ mongo
MongoDB shell version: 3.6.5
...
>

Luckily, you shouldn't need to do that for the interactive part. Ready?

We need data for the interactive part

Let's see who's who in Duckburg

Preliminary operations
In [3]:
import pymongo
client = pymongo.MongoClient('mongodb://localhost:27017/')
duckburg_db = client.duckburg # silently creates db
client.database_names()       # it's empty, thus only default dbs will show up
Out[3]:
['admin', 'config', 'local']
In [4]:
ducks = duckburg_db.ducks       # silently creates collection within db
duckburg_db.collection_names()  # it will not show up, either
Out[4]:
[]
Insertion of documents in a collection

The easiest way to put one document in a db is through the insert_one method

In [5]:
r = ducks.insert_one({'first_name': 'Donald',
                     'last_name': 'Duck',
                     'gender': 'M',
                     'car': {'model': 'American Bantam',
                             'license_plate': 313},
                     'birth_year': 1920,
                     'first_appearance': 1934})
In [6]:
r
Out[6]:
<pymongo.results.InsertOneResult at 0x7f53d8cbe6c0>

The insert_one method returns an object gathering information about the result

  • acknowledged is a boolean field which becomes True once the write operation has been acknowledged (there's more here, but we will not consider it)
In [7]:
r.acknowledged
Out[7]:
True
  • inserted_id is a field containing an univocal identificator for the newly created document (we'll see this more in depth in a moment)
In [8]:
r.inserted_id
Out[8]:
ObjectId('5b0c602be45bba0cb44ca461')
What happens after this first write operation?
In [9]:
print(client.database_names())
print(duckburg_db.collection_names())
['admin', 'config', 'duckburg', 'local']
['ducks']
A couple of technical details
  • maximum document size: 16MB
  • insertion also as an upsert (will see that later)
Let's populate the db
In [10]:
characters = [
    {'first_name': 'Grandma', 'last_name': 'Duck', 'gender': 'F',
     'car': {'model': 'Detroit Electric'},
     'birth_year': 1833, 'first_appearance': 1943,
     'hobbies': ['cooking', 'gardening']},
    {'first_name': 'Scrooge', 'last_name': 'McDuck', 'gender': 'M',
     'birth_year': 1867, 'first_appearance': 1947,
     'hobbies': ['finance', 'savings', 'swimming in money bins']},
    {'first_name': 'Gyro', 'last_name': 'Gearloose', 'gender': 'M',
     'first_appearance': 1952, 'hobbies': ['invention', 'study']},
    {'first_name': 'Ludwig', 'last_name': 'Von Drake', 'gender': 'M',
     'first_appearance': 1961, 'hobbies': ['study']}
]

r = ducks.insert_many(characters)
r
Out[10]:
<pymongo.results.InsertManyResult at 0x7f53d8cd3870>
In [11]:
r.inserted_ids
Out[11]:
[ObjectId('5b0c604de45bba0cb44ca462'),
 ObjectId('5b0c604de45bba0cb44ca463'),
 ObjectId('5b0c604de45bba0cb44ca464'),
 ObjectId('5b0c604de45bba0cb44ca465')]
Reading a collection I
In [12]:
duck = ducks.find_one()
duck
Out[12]:
{'_id': ObjectId('5b0c602be45bba0cb44ca461'),
 'birth_year': 1920,
 'car': {'license_plate': 313, 'model': 'American Bantam'},
 'first_appearance': 1934,
 'first_name': 'Donald',
 'gender': 'M',
 'last_name': 'Duck'}

There is no guarantee about which document will be retrieved!

Note that JSON-encoded documents can be accessed using the dictionary syntax

In [13]:
duck['first_name']
Out[13]:
'Donald'
Wait: _id? ObjectId?
In [14]:
ducks.find_one()['_id']
Out[14]:
ObjectId('5b0c602be45bba0cb44ca461')

The `_id` field is automatically added to all documents; it contains a unique identifier to be used for indexing purposes. By default MongoDB sets it to an `ObjectId` instance

In [15]:
from bson.objectid import ObjectId
[ObjectId(), ObjectId(), ObjectId()]
Out[15]:
[ObjectId('5b0c6054e45bba0cb44ca466'),
 ObjectId('5b0c6054e45bba0cb44ca467'),
 ObjectId('5b0c6054e45bba0cb44ca468')]
Reading a collection II

Selection of several documents matching a query is done through the find method

In [16]:
r = ducks.find()
r
Out[16]:
<pymongo.cursor.Cursor at 0x7f53d8cbf080>

We will speak of cursors in a while, for the moment we just cast them to lists in order to see which documents are matched

In [17]:
list(ducks.find())
Out[17]:
[{'_id': ObjectId('5b0c602be45bba0cb44ca461'),
  'birth_year': 1920,
  'car': {'license_plate': 313, 'model': 'American Bantam'},
  'first_appearance': 1934,
  'first_name': 'Donald',
  'gender': 'M',
  'last_name': 'Duck'},
 {'_id': ObjectId('5b0c604de45bba0cb44ca462'),
  'birth_year': 1833,
  'car': {'model': 'Detroit Electric'},
  'first_appearance': 1943,
  'first_name': 'Grandma',
  'gender': 'F',
  'hobbies': ['cooking', 'gardening'],
  'last_name': 'Duck'},
 {'_id': ObjectId('5b0c604de45bba0cb44ca463'),
  'birth_year': 1867,
  'first_appearance': 1947,
  'first_name': 'Scrooge',
  'gender': 'M',
  'hobbies': ['finance', 'savings', 'swimming in money bins'],
  'last_name': 'McDuck'},
 {'_id': ObjectId('5b0c604de45bba0cb44ca464'),
  'first_appearance': 1952,
  'first_name': 'Gyro',
  'gender': 'M',
  'hobbies': ['invention', 'study'],
  'last_name': 'Gearloose'},
 {'_id': ObjectId('5b0c604de45bba0cb44ca465'),
  'first_appearance': 1961,
  'first_name': 'Ludwig',
  'gender': 'M',
  'hobbies': ['study'],
  'last_name': 'Von Drake'}]
A more compact view of documents

Let's build a function providing a more compact visualization of ducks

In [18]:
def show_duck(d):
    try:
        return '{0} {1}'.format(d['first_name'], d['last_name'])
    except KeyError:
        return d['first_name']
    
show_duck(duck)
Out[18]:
'Donald Duck'
Reading a collection: simple selectors I

Based on the existence of specific fields

In [19]:
[show_duck(d) for d in ducks.find({'car': {'$exists': True}})]
Out[19]:
['Donald Duck', 'Grandma Duck']
Reading a collection: simple selectors II

Selecting specific values

In [20]:
[show_duck(d) for d in ducks.find({'gender': 'F'})]
Out[20]:
['Grandma Duck']
In [21]:
[show_duck(d) for d in ducks.find({'first_appearance': 1961})]
Out[21]:
['Ludwig Von Drake']
Reading a collection: simple selectors III

On the basis of a relation

In [22]:
[show_duck(d) for d in ducks.find({'first_appearance': {'$gt': 1950}})]
Out[22]:
['Gyro Gearloose', 'Ludwig Von Drake']

Filtering more than one value for a field

In [23]:
[show_duck(d) for d in ducks.find({'first_appearance': {'$in': [1947, 1961]}})]
Out[23]:
['Scrooge McDuck', 'Ludwig Von Drake']
Reading a collection: complex selectors I

Logical conjunction

In [24]:
[show_duck(d) for d in ducks.find({'first_appearance': {'$lt': 1950}, 
                                   'gender': 'M'})]
Out[24]:
['Donald Duck', 'Scrooge McDuck']

Logical disjunction

In [25]:
[show_duck(d) for d in ducks.find({'$or': [{'birth_year': {'$gt': 1900}}, 
                                           {'gender': 'F'}]})]
Out[25]:
['Donald Duck', 'Grandma Duck']
Reading a collection: complex selectors II

Array contents

In [26]:
[show_duck(d) for d in ducks.find({'hobbies': 'study'})]
Out[26]:
['Gyro Gearloose', 'Ludwig Von Drake']
In [27]:
[show_duck(d) for d in ducks.find({'hobbies': ['study']})]
Out[27]:
['Ludwig Von Drake']
In [28]:
[show_duck(d) for d in ducks.find({'hobbies.1': 'study'})]
Out[28]:
['Gyro Gearloose']
Reading a collection: complex selectors II

Embedded documents

In [29]:
[show_duck(d) for d in ducks.find({'car.model': 'American Bantam'})]
Out[29]:
['Donald Duck']
Reading collections: complex selectors III

Regular expressions

In [30]:
import re
regx = re.compile("uck$", re.IGNORECASE)

[show_duck(d) for d in ducks.find({'last_name': regx})]
Out[30]:
['Donald Duck', 'Grandma Duck', 'Scrooge McDuck']
Beware of flexibility

Using regular expressions, as well as `$lt` or `$gt`, may lead to inefficiency in query execution.

Projections I

Implemented via a second argument to `find`

  • either specifying fields to be shown...
In [31]:
list(ducks.find({'hobbies.1': 'study'},
                {'first_name': 1, 'last_name': 1}))
Out[31]:
[{'_id': ObjectId('5b0c604de45bba0cb44ca464'),
  'first_name': 'Gyro',
  'last_name': 'Gearloose'}]
Projections II
  • ...or fields to be excluded.
In [32]:
list(ducks.find({'hobbies.1': 'study'},
                {'_id': 0, 'gender': 0, 'birth_year': 0,
                 'first_appearance': 0, 'hobbies': 0}))
Out[32]:
[{'first_name': 'Gyro', 'last_name': 'Gearloose'}]
Projections III

  • The select/exclude options are mutually exclusive.
  • Only exception: the `_id` field can be filtered out:
In [33]:
list(ducks.find({'hobbies.1': 'study'},
                {'first_name': 1, 'last_name': 1, '_id': 0}))
Out[33]:
[{'first_name': 'Gyro', 'last_name': 'Gearloose'}]
Indicization I

Collections can be indexed specifying a set of fields and the corresponding sort order (the syntax used by is in this case different from the original API)

In [34]:
ducks.create_index([('first_name', pymongo.ASCENDING),
                    ('first_appearance', pymongo.DESCENDING)])
Out[34]:
'first_name_1_first_appearance_-1'
Indicization II

If a query is covered by an index, that is

  • all filtered fields,
  • and all displayed fields

are within the index, the query is executed via index traversal.

In [35]:
regx = re.compile("^G.*", re.IGNORECASE)
list(ducks.find({'first_name': regx, 'first_appearance': {'$lt': 1955}},
                {'_id': 0, 'first_name': 1, 'first_appearance': 1}))
Out[35]:
[{'first_appearance': 1943, 'first_name': 'Grandma'},
 {'first_appearance': 1952, 'first_name': 'Gyro'}]
Cursors

find returns a cursor to the query resultset, accessible via

  • invocation of the next method (until StopIteration)
In [36]:
cursor = ducks.find()
try:
    while True:
        print(show_duck(cursor.next()))
except StopIteration:
    pass
Donald Duck
Grandma Duck
Scrooge McDuck
Gyro Gearloose
Ludwig Von Drake
  • positional access through __getitem__ (aka [])
In [37]:
cursor = ducks.find()
show_duck(cursor[1])
Out[37]:
'Grandma Duck'
In [38]:
show_duck(cursor[4])
Out[38]:
'Ludwig Von Drake'

(even going back)

In [39]:
show_duck(cursor[2])
Out[39]:
'Scrooge McDuck'
  • list conversion, for-based iteration,
In [40]:
cursor = ducks.find()

for d in cursor:
    print(show_duck(d))
Donald Duck
Grandma Duck
Scrooge McDuck
Gyro Gearloose
Ludwig Von Drake
  • invocation of the skip, limit, sort, and count methods
In [41]:
cursor = ducks.find().sort('last_name', pymongo.DESCENDING)
for d in cursor:
    print(show_duck(d))
Ludwig Von Drake
Scrooge McDuck
Gyro Gearloose
Donald Duck
Grandma Duck
In [42]:
cursor.rewind()
Out[42]:
<pymongo.cursor.Cursor at 0x7f53d8cea0b8>
The update method is wicked!
In [43]:
ducks.insert_one({'first_name': 'Magica', 'last_name': 'De Spell',
                  'first_appearance': 1961})
ducks.update({'first_name': 'Magica'}, {'first_appearance': 2000})
/home/malchiodi/apps/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: DeprecationWarning: update is deprecated. Use replace_one, update_one or update_many instead.
  This is separate from the ipykernel package so we can avoid doing imports until
Out[43]:
{'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True}

Note a couple of things:

  • the used method is deprecated;
  • it returns a dictionary (instead of an object) summarizing the effects of the operation

Let's double-check:

In [44]:
ducks.find_one({'first_name': 'Magica'})
In [45]:
ducks.find_one({'first_appearance': 2000})
Out[45]:
{'_id': ObjectId('5b0c61c3e45bba0cb44ca469'), 'first_appearance': 2000}

This (deprecated) form of update completely replaced the document!

Solution

Use the update_one and update_many methods

In [46]:
# OK, let's revert things
ducks.find_one_and_delete({'first_appearance': 2000})
ducks.insert_one({'first_name': 'Magica', 'last_name': 'De Spell',
                  'first_appearance': 1961})

ducks.update_one({'first_name': 'Magica'}, {'first_appearance': 2000})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-46-6a0ad72fdd9c> in <module>()
      4                   'first_appearance': 1961})
      5 
----> 6 ducks.update_one({'first_name': 'Magica'}, {'first_appearance': 2000})

~/apps/anaconda3/lib/python3.6/site-packages/pymongo/collection.py in update_one(self, filter, update, upsert, bypass_document_validation, collation)
    887         """
    888         common.validate_is_mapping("filter", filter)
--> 889         common.validate_ok_for_update(update)
    890         with self._socket_for_writes() as sock_info:
    891             result = self._update(sock_info, filter, update, upsert,

~/apps/anaconda3/lib/python3.6/site-packages/pymongo/common.py in validate_ok_for_update(update)
    437     first = next(iter(update))
    438     if not first.startswith('$'):
--> 439         raise ValueError('update only works with $ operators')
    440 
    441 

ValueError: update only works with $ operators
WTF(irst_appearance)?
  • This error is due to the fact that these methods don't allow to implicitly replace a whole document
  • Instead, the use of $set, which actually updates one or more fields leaving the rest of the document untouched, is enforced
In [47]:
r = ducks.update_one({'first_name': 'Magica'}, {'$set': {'first_appearance': 2000}})
r
Out[47]:
<pymongo.results.UpdateResult at 0x7f53d8cd3c60>

The operation returns an object gathering information about the result:

In [48]:
# n. of matched documents
print(r.matched_count)
1
In [48]:
# n. of modified documents
r.modified_count
Out[48]:
1
In [49]:
# None if no upsertion, id of created document otherwise
# this deals with upsertions, will be explained later on
r.upserted_id
In [50]:
# results in the update-style
r.raw_result
Out[50]:
{'n': 1, 'nModified': 1, 'ok': 1.0, 'updatedExisting': True}

We can check that the method behaves as expected

In [37]:
ducks.find_one({'first_name': 'Magica'})
Out[37]:
{'_id': ObjectId('5b0b3d92e45bba1692b3bf4d'),
 'first_appearance': 2000,
 'first_name': 'Magica',
 'last_name': 'De Spell'}

`update_one` only updates one of the identified documents, `update_many` updates all of them.

Removing fields

There is also the possibility of removing fields in a document, using the `$unset` operator.

In [49]:
ducks.update_one({'first_name': 'Magica'}, {'$unset': {'first_appearance': ''}})
Out[49]:
<pymongo.results.UpdateResult at 0x7f53d8ced240>
In [50]:
ducks.find_one({'first_name': 'Magica'})
Out[50]:
{'_id': ObjectId('5b0c61cce45bba0cb44ca46a'),
 'first_name': 'Magica',
 'last_name': 'De Spell'}
Advanced update actions
  • `$inc` and `$dec` respectively increment and decrement a numeric quantity
  • `$push` adds elements to arrays
In [51]:
ducks.update_one({'first_name': 'Gyro'}, {'$push': {'hobbies': 'lamps'}})
ducks.find_one({'first_name': 'Gyro'})
Out[51]:
{'_id': ObjectId('5b0c604de45bba0cb44ca464'),
 'first_appearance': 1952,
 'first_name': 'Gyro',
 'gender': 'M',
 'hobbies': ['invention', 'study', 'lamps'],
 'last_name': 'Gearloose'}
  • `$pop` extracts the last or first element in an array
  • `$pull` removes all instances of an element from an array
The `$` operator

When dealing with arrays, `$` is a special operator denoting the index of the first occurrence of a value matched in an update query

In [52]:
ducks.update_one({'hobbies': 'lamps'}, {'$set': {'hobbies.$': 'robotic lamps'}})
ducks.find_one({'first_name': 'Gyro'})['hobbies']
Out[52]:
['invention', 'study', 'robotic lamps']
Advanced uses of `$`

Note that `$` can be used also in order to dig down to hierarchies. Suppose a document contains the following value for `hobbies`:

[{'name': 'invention', 'day': 'Monday'},
 {'name': 'study', 'day': 'Friday'}]

The update query modifies the day for the study hobby:

{{'hobbies.name': 'study'}, {'$set': {'hobbies.$.day': 'Tuesday'}}}
Upserts

What happens if one attempts to update a non-existing document?

  • nothing...
  • ...unless the `upsert=True` additional argument has been specified in `update_one` or `update_many`: in this case the operation results in an insert-upon-update (upsert).
Removing...
  • at most one document identified by a query
In [ ]:
ducks.delete_one({'first_name': 'Ludwig'})
  • all documents identified by a query
In [ ]:
ducks.delete_many({'first_name': 'Ludwig'})
  • all documents in a collection
In [ ]:
ducks.delete_many()
  • a collection
In [272]:
ducks.drop()
  • a database
In [ ]:
client.drop_database('duckburg')

# or

duckburg_db.command('dropDatabase')
Joins and scalability
  • Scalability in MongoDB is achieved through sharding, that is distributing and replicating data across several machines
  • This however doesn't allow to perform joins automatically
Handcrafted joins I

`ObjectId`s can be used to manually recreate one-to-many joins...

In [275]:
donald_id = ducks.find_one({'first_name': 'Donald'})['_id']
for d in ['Huey', 'Dewey', 'Louie']:
    ducks.insert_one({'first_name': d, 'uncle': donald_id})

...at the price of an additional query (to recover, in this case, the uncle).

Handcrafted joins II

Many-to-many joins can be implemented

  • using arrays
ducks.insert_one({'first_name': 'Donald', 'last_name': 'Duck',
                  'relatives': [ObjectId('516405b8b356cd6125b74e89'),
                                ObjectId('516405b8b356cd6125b74e8a'),
                                ObjectId('516405b8b356cd6125b74e8b')]})
ducks.find({'relatives': ObjectId('516405b8b356cd6125b74e8a')})
  • or embedded documents
ducks.insert_one({'first_name': 'Donald', 'last_name': 'Duck',
                  'relatives': [{'first_name': 'Huey'},
                                {'first_name': 'Dewey'},
                                {'first_name': 'Louie'}]})
ducks.find_one({'relatives.first_name': 'Dewey'})
Denormalization
  • a possible solution (duplication not to be necessarily demonized)
  • critical limit: the maximal size of each document is 16 MBytes
  • (aside note: the whole text of Shakespeare's Hamlet requires 200 Kbytes)
  • the actual price to be paid for this is in having a higher number of queries (for instance, when duplicate data are updated)
Final note

There's no universal solution: the choice depends on the problem under study, as well as on personal design style

There's much more
  • Sharding
  • Map-Reduce
  • Aggregation framework
  • ...
Your turn!
  • Download the notebook and dataset in a computer having mongodb and jupyter installed
  • Unzip the dataset and open the notebook
  • Solve the proposed challenges adding cells just below each question
  • Save your results in a notebook named using your name followed by "-mongodb"

The following cells contain some challenges: solve them and send the results to malchiodi@di.unimi.it, specifying Sophia Antipolis lecture assignment 1 as subject.

You can consult with classmates but what you send should be the exclusive result of your work, and in any case do not copy code.

Import the dataset

We will work with a book catalog already encoded in JSON format and available at the URL https://github.com/ozlerhakan/mongodb-json-files/blob/master/datasets/catalog.books.json

You don't need to download this dataset, as it is already in the Docker image we are using. You need however to import it into MongoDB: the next cell takes care of this task, creating a books collection within a db having the same name.

In [53]:
%%bash
mongoimport --db books --collection books --file catalog.books.json
2018-05-28T22:19:45.687+0200	connected to: localhost
2018-05-28T22:19:45.716+0200	imported 431 documents

Inspect data

Although we saw that a tabular description might not be the best resource to be used when visualizing noSQL data, we can build and show a Pandas dataframe as a quick way to peek into the dataset.

In [54]:
db = client.books
books = db.books

import pandas as pd
pd.DataFrame(list(books.find().limit(5)))
Out[54]:
_id authors categories isbn longDescription pageCount publishedDate shortDescription status thumbnailUrl title
0 2 [W. Frank Ableson, Robi Sen] [Java] 1935182722 When it comes to mobile apps, Android can do a... 592 2011-01-14 Android in Action, Second Edition is a compreh... PUBLISH https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.... Android in Action, Second Edition
1 3 [Gojko Adzic] [Software Engineering] 1617290084 NaN 0 2011-06-03 NaN PUBLISH https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.... Specification by Example
2 1 [W. Frank Ableson, Charlie Collins, Robi Sen] [Open Source, Mobile] 1933988673 Android is an open source mobile phone platfor... 416 2009-04-01 Unlocking Android: A Developer's Guide provide... PUBLISH https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.... Unlocking Android
3 4 [Tariq Ahmed with Jon Hirschi, Faisal Abid] [Internet] 1933988746 New web applications require engaging user-fri... 576 2009-02-02 NaN PUBLISH https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.... Flex 3 in Action
4 6 [Satnam Alag] [Internet] 1933988312 There's a great deal of wisdom in a crowd, but... 425 2008-10-01 NaN PUBLISH https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.... Collective Intelligence in Action
Assignments
  1. How many books are there?
  2. Which is the highest number of co-authors in a book?
  3. How many distinct authors are there?
  4. Which author has the highest number of coauthors?
  5. How many coauthors has this author and who are these coauthors?
  6. Deduplicate author
  7. Remove grouped authors
  8. Write (but don't execute!) a query removing books with null authors
  9. Which class is used in order to store time stamps?
  10. How many books were published in 2009?
  11. Which are the least and the most recent publication years?
  12. Display a bar plot showing the number of all published books between 1992 and 2015
  13. Build a function get_coauthors_for_years taking two years as arguments and returning a dictionary associating each author to the set of his coauthors within the corresponding time span (if any of the arguments is None the search should be done on all books)
  14. Use the function get_coauthors_for_years in order to draw the collaboration graph of all books published between 1997 and 1999 (Hint: use the nextworkx.Graph class, specifically its add_node and add_edge methods, as well as the networkx.draw method in order to show a graph)
  15. Build and save the collaboration matrix of all authors
  16. Build and save the collaboration matrix of all books
  17. Build and save additional information

How many books are there?

Which is the highest number of co-authors in a book?

How many distinct authors are there?

Which author has the highest number of co-authors?

Note: empty string do not count as authors

How many coauthors has this author?

Who are these coauthors?

Deduplicate author

It seems our data are not that clean: one author is actually the empty string, Kimberly Tripp occurs twice (one time with the middle name, another one without it), and in two cases the author's name is actually the name for a group of authors. Let's clean up these things. First of all we can deduplicate records referring to Kimberly Tripp: write an update query modifying all occurrences of 'Kimberly L. Tripp' into 'Kimberly Tripp'

Remove grouped authors

Write an update query removing all occurrences of 'Contributions from 53 SQL Server MVPs' and Contributions from 53 SQL Server MVPs; Edited by Paul Nielsen as author names, replacing them with the empty string.

Write (but don't execute!) a query removing books with missing author

Write a query which would remove all documents having the empty string as author. Don't execute the query, as it will remove interesting information we will need later on.

Which class is used in order to store time stamps?

How many books were published in 2009?

Which are the least and the most recent publication years?

Display a bar plot showing the number of all published books between 1992 and 2015

Build a function get_coauthors_for_years taking two years as arguments and returning a dictionary associating each author to the set of his coauthors within the corresponding time span (if any of the arguments is None the search should be done on all books)

Use the function get_coauthors_for_years in order to draw the collaboration graph of all books published between 1997 and 1999

The collaboration graph here is intended as a graph having authors as nodes and connecting two nodes if the corresponding authors have coauthored at least one book.

Hint: use the nextworkx.Graph class, specifically its add_node and add_edge methods, as well as the networkx.draw method in order to show a graph.

Build and save the collaboration matrix of all authors

Consider all books regardless of publication year, and build the authors' collaboration matrix, that is a square matrix whose dimension equals the number of authors. Entry $(r, c)$ in this matrix (where $r$ and $c$ refer respectively to rows and columns) should be 0 if $r$-th and $c$-th author never coauthored a book, otherwise it should be set to $\frac{1}{n_c}$, where $n_c$ is the total number of coauthors of $c$-th author.

We'll be interested actually only in non-null entries: save them as one row of a CSV file using the format $(r, c, m_{rc})$, being $m_{rc}$ the non-null element in row $r$ and column $c$. Use the name author-connections.csv for this file.

Hint: if you get confused about what to place in rows and columns, take in mind that the resulting matrix should be columns-wise stochastic, that is the sum of all elements in any columns has to be equal to 1.

Build and save the collaboration matrix of all books

The collaboration matrix of books is obtained in the same way as those of authors, where now two books are connected (and thus identify a non-null entry in the matrix) if they share at least one author. Also in this case, the obtained matrix should be column-wise stochastic.

Build and save additional information

Write in a CSV file the association of each book to the corresponding indices of rows and columns in the previously produced book connection matrix: each row of the file will have the format $(i,t)$, where $i$ is the index and $t$ is the title. Name this file 'book-index.csv'

Write a file 'book-categories.csv' in which each line has the format $(i, j)$, where $i$ is an index of the book (in the previously produced matrix and vector) and $j$ is 1 if the corresponding book has the Java category in the dataset, and 0 otherwise.