The `_id` field is automatically added to all documents; it contains a unique identifier to be used for indexing purposes. By default MongoDB sets it to an `ObjectId` instance
Collections can be indexed specifying a set of fields and the corresponding sort order (the syntax used by is in this case different from the original API)
What happens if one attempts to update a non-existing document?
\n",
"\n",
"
\n",
"
nothing...
\n",
"
...unless the `upsert=True` additional argument has been specified in `update_one` or `update_many`: in this case the operation results in an insert-upon-update (upsert).
Copy the `http://localhost:8888?token=...` link output when the container starts
\n",
"
Select Import-and-test.ipynb
\n",
"
Either execute the code cells or copy and paste them into a new notebook
\n",
"
Remember that the container file system will be wiped out, thus regularly save it (File/Save and Checkpoint) and download it on the host machine (File/Download as/Notebook)\n",
"
\n",
"\n",
"
The following cells contain some challenges: solve them and send the results to malchiodi@di.unimi.it, specifying Sophia Antipolis lecture assignment 1 as subject.
\n",
"\n",
"
You can consult with classmates but what you send should be the exclusive result of your work, and in any case do not copy code.
"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Import the dataset ###\n",
"\n",
"We will work with a book catalog already encoded in JSON format and available at the URL https://github.com/ozlerhakan/mongodb-json-files/blob/master/datasets/catalog.books.json\n",
"\n",
"You don't need to download this dataset, as it is already in the Docker image we are using. You need however to import it into MongoDB: the next cell takes care of this task, creating a `books` collection within a db having the same name."
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2018-05-31T17:27:12.299+0200\tconnected to: localhost\n",
"2018-05-31T17:27:12.327+0200\timported 431 documents\n"
]
}
],
"source": [
"%%bash\n",
"mongoimport --db books --collection books --file catalog.books.json"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Inspect data ###\n",
"\n",
"Although we saw that a tabular description might not be the best resource to be used when visualizing noSQL data, we can build and show a Pandas dataframe as a quick way to peek into the dataset."
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
_id
\n",
"
authors
\n",
"
categories
\n",
"
isbn
\n",
"
longDescription
\n",
"
pageCount
\n",
"
publishedDate
\n",
"
shortDescription
\n",
"
status
\n",
"
thumbnailUrl
\n",
"
title
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
3
\n",
"
[Gojko Adzic]
\n",
"
[Software Engineering]
\n",
"
1617290084
\n",
"
NaN
\n",
"
0
\n",
"
2011-06-03
\n",
"
NaN
\n",
"
PUBLISH
\n",
"
https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....
\n",
"
Specification by Example
\n",
"
\n",
"
\n",
"
1
\n",
"
4
\n",
"
[Tariq Ahmed with Jon Hirschi, Faisal Abid]
\n",
"
[Internet]
\n",
"
1933988746
\n",
"
New web applications require engaging user-fri...
\n",
"
576
\n",
"
2009-02-02
\n",
"
NaN
\n",
"
PUBLISH
\n",
"
https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....
\n",
"
Flex 3 in Action
\n",
"
\n",
"
\n",
"
2
\n",
"
1
\n",
"
[W. Frank Ableson, Charlie Collins, Robi Sen]
\n",
"
[Open Source, Mobile]
\n",
"
1933988673
\n",
"
Android is an open source mobile phone platfor...
\n",
"
416
\n",
"
2009-04-01
\n",
"
Unlocking Android: A Developer's Guide provide...
\n",
"
PUBLISH
\n",
"
https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....
\n",
"
Unlocking Android
\n",
"
\n",
"
\n",
"
3
\n",
"
10
\n",
"
[Alexandre de Castro Alves]
\n",
"
[Java]
\n",
"
193518217X
\n",
"
A good application framework greatly simplifie...
\n",
"
325
\n",
"
2011-12-12
\n",
"
Enterprise OSGi shows a Java developer how to ...
\n",
"
PUBLISH
\n",
"
https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....
\n",
"
OSGi in Depth
\n",
"
\n",
"
\n",
"
4
\n",
"
5
\n",
"
[Tariq Ahmed, Dan Orlando, John C. Bland II, J...
\n",
"
[Internet]
\n",
"
1935182420
\n",
"
Using Flex, you can create high-quality, effec...
\n",
"
600
\n",
"
2010-11-15
\n",
"
NaN
\n",
"
PUBLISH
\n",
"
https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ....
\n",
"
Flex 4 in Action
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" _id authors \\\n",
"0 3 [Gojko Adzic] \n",
"1 4 [Tariq Ahmed with Jon Hirschi, Faisal Abid] \n",
"2 1 [W. Frank Ableson, Charlie Collins, Robi Sen] \n",
"3 10 [Alexandre de Castro Alves] \n",
"4 5 [Tariq Ahmed, Dan Orlando, John C. Bland II, J... \n",
"\n",
" categories isbn \\\n",
"0 [Software Engineering] 1617290084 \n",
"1 [Internet] 1933988746 \n",
"2 [Open Source, Mobile] 1933988673 \n",
"3 [Java] 193518217X \n",
"4 [Internet] 1935182420 \n",
"\n",
" longDescription pageCount publishedDate \\\n",
"0 NaN 0 2011-06-03 \n",
"1 New web applications require engaging user-fri... 576 2009-02-02 \n",
"2 Android is an open source mobile phone platfor... 416 2009-04-01 \n",
"3 A good application framework greatly simplifie... 325 2011-12-12 \n",
"4 Using Flex, you can create high-quality, effec... 600 2010-11-15 \n",
"\n",
" shortDescription status \\\n",
"0 NaN PUBLISH \n",
"1 NaN PUBLISH \n",
"2 Unlocking Android: A Developer's Guide provide... PUBLISH \n",
"3 Enterprise OSGi shows a Java developer how to ... PUBLISH \n",
"4 NaN PUBLISH \n",
"\n",
" thumbnailUrl title \n",
"0 https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.... Specification by Example \n",
"1 https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.... Flex 3 in Action \n",
"2 https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.... Unlocking Android \n",
"3 https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.... OSGi in Depth \n",
"4 https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.... Flex 4 in Action "
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"db = client.books\n",
"books = db.books\n",
"\n",
"import pandas as pd\n",
"pd.DataFrame(list(books.find().limit(5)))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"
Assignments
\n",
"\n",
"1. How many books are there?\n",
"2. Which is the highest number of co-authors in a book?\n",
"3. How many distinct authors are there?\n",
"4. Which author has the highest number of coauthors?\n",
"5. How many coauthors has this author and who are these coauthors?\n",
"6. Deduplicate author\n",
"7. Remove grouped authors\n",
"8. Write (but don't execute!) a query removing books with null authors\n",
"9. Which class is used in order to store time stamps?\n",
"10. How many books were published in 2009?\n",
"11. Which are the least and the most recent publication years?\n",
"12. Display a bar plot showing the number of all published books between 1992 and 2015\n",
"13. Build a function `get_coauthors_for_years` taking two years as arguments and returning a dictionary associating each author to the set of his coauthors within the corresponding time span (if any of the arguments is `None` the search should be done on all books)\n",
"14. Use the function `get_coauthors_for_years` in order to draw the collaboration graph of all books published between 1997 and 1999 (Hint: use the `nextworkx.Graph` class, specifically its `add_node` and `add_edge` methods, as well as the `networkx.draw` method in order to show a graph)\n",
"15. Build and save the collaboration matrix of all authors\n",
"16. Build and save the collaboration matrix of all books\n",
"17. Build and save additional information"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### How many books are there? ###"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Which is the highest number of co-authors in a book? ###"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### How many distinct authors are there? ###"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Which author has the highest number of co-authors? ###\n",
"Note: empty string do not count as authors"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### How many coauthors has this author? ###"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Who are these coauthors? ###"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Deduplicate author ###\n",
"\n",
"It seems our data are not that clean: one author is actually the empty string, Kimberly Tripp occurs twice (one time with the middle name, another one without it), and in two cases the author's name is actually the name for a group of authors. Let's clean up these things. First of all we can deduplicate records referring to Kimberly Tripp: write an update query modifying all occurrences of `'Kimberly L. Tripp'` into `'Kimberly Tripp'`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Remove grouped authors ###\n",
"\n",
"Write an update query removing all occurrences of `'Contributions from 53 SQL Server MVPs'` and `Contributions from 53 SQL Server MVPs; Edited by Paul Nielsen` as author names, replacing them with the empty string."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Write (but don't execute!) a query removing books with missing author ###\n",
"\n",
"Write a query which would remove all documents having the empty string as author. Don't execute the query, as it will remove interesting information we will need later on."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`books.delete_many({'authors': ''})`"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Which class is used in order to store time stamps? ###"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### How many books were published in 2009? ###"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Which are the least and the most recent publication years? ###"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Display a bar plot showing the number of all published books between 1992 and 2015 ###"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Build a function `get_coauthors_for_years` taking two years as arguments and returning a dictionary associating each author to the set of his coauthors within the corresponding time span (if any of the arguments is `None` the search should be done on all books) ### "
]
},
{
"cell_type": "code",
"execution_count": 479,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Use the function `get_coauthors_for_years` in order to draw the collaboration graph of all books published between 1997 and 1999 ###\n",
"\n",
"The collaboration graph here is intended as a graph having authors as nodes and connecting two nodes if the corresponding authors have coauthored at least one book.\n",
"\n",
"Hint: use the `nextworkx.Graph` class, specifically its `add_node` and `add_edge` methods, as well as the `networkx.draw` method in order to show a graph."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Build and save the collaboration matrix of all authors ###\n",
"\n",
"Consider all books regardless of publication year, and build the authors' collaboration matrix, that is a square matrix whose dimension equals the number of authors. Entry $(r, c)$ in this matrix (where $r$ and $c$ refer respectively to rows and columns) should be 0 if $r$-th and $c$-th author never coauthored a book, otherwise it should be set to $\\frac{1}{n_c}$, where $n_c$ is the total number of coauthors of $c$-th author.\n",
"\n",
"We'll be interested actually only in non-null entries: save them as one row of a CSV file using the format $(r, c, m_{rc})$, being $m_{rc}$ the non-null element in row $r$ and column $c$. Use the name `author-connections.csv` for this file.\n",
"\n",
"Hint: if you get confused about what to place in rows and columns, take in mind that the resulting matrix should be columns-wise stochastic, that is the sum of all elements in any columns has to be equal to 1."
]
},
{
"cell_type": "code",
"execution_count": 546,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Build and save the collaboration matrix of all books ###\n",
"\n",
"The collaboration matrix of books is obtained in the same way as those of authors, where now two books are connected (and thus identify a non-null entry in the matrix) if they share at least one author. Also in this case, the obtained matrix should be column-wise stochastic."
]
},
{
"cell_type": "code",
"execution_count": 628,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Build and save additional information ###\n",
"\n",
"Write in a CSV file the association of each book to the corresponding indices of rows and columns in the previously produced book connection matrix: each row of the file will have the format $(i,t)$, where $i$ is the index and $t$ is the title. Name this file `'book-index.csv'`"
]
},
{
"cell_type": "code",
"execution_count": 643,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Write a file `'book-categories.csv'` in which each line has the format $(i, j)$, where $i$ is an index of the book (in the previously produced matrix and vector) and $j$ is 1 if the corresponding book has the Java category in the dataset, and 0 otherwise."
]
},
{
"cell_type": "code",
"execution_count": 644,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"livereveal": {
"center": "false",
"theme": "white",
"transition": "none"
}
},
"nbformat": 4,
"nbformat_minor": 2
}