INFSCI 2711 Advanced Topics in Database Management

About

This is the website for the INFSCI 2711: Advanced Topics in Database Management class at School of Information Sciences, University of Pittsburgh.

Topics that are covered in this class:

Data Integration (OLAP and Data Warehousing, Virtual Data Integration)
Distributed and Parallel Databases (including distributed transactions and query execution)
NoSQL databases, NewSQL databases (Main Memory Databases)
Cluster Computing (Hadoop and other animals, Spark)

Project - MultiDBs

Basic Idea

We have a bunch of distributed heterogenous data sources (RDBMS, data files (excel, csv, dat, spss, etc)). We want to keep track of all sources, be able to keyword search throughout all of them, write complex SQL queries to combined data from several of them, be able to use/program complex data analysis algorithms and be able to visualize data in various forms and shapes.

One big project for the whole class. Each team work on independent components. Teams need to collaborate on the communication protocols between the components

Open Source

All code is open-source and hosted on github under infsci2711 organizaiton

Teams

MetaStore

A catalog to keep connection information to all data stores and some statistics about them.

GitHub page: MultiDBs-MetaStore GitHub

More info&docs: MultiDBs-MetaStore Docs

Project page: MultiDBs-MetaStore

Query

Utilize PrestoDB SQL engine to query data stores registered in the MetaStore catalog.

GitHub page: MultiDBs-Query GitHub

More info&docs: MultiDBs-Query Docs

Project page: MultiDBs-Query

Keyword Search Relational

Google like keyword search through all registered stores based on relational index approach.

GitHub page: MultiDBs-KeywordSearchR GitHub

More info&docs: MultiDBs-KeywordSearchR Docs

Project page: MultiDBs-KeywordSearchR

Keyword Search Graph I

The same as Keyword Search Relational, but try to do it in a different way by using graph database (Neo4j) to maintain schema graph.

GitHub page: MultiDBs-KeywordSearchG GitHub

More info&docs: MultiDBs-KeywordSearchG I Docs

Project page: MultiDBs-KeywordSearchG I

Keyword Search Graph II

The same as Keyword Search Relational, but try to do it in a different way by using graph database (Neo4j) to maintain schema graph.

GitHub page: MultiDBs-KeywordSearchG GitHub

More info&docs: MultiDBs-KeywordSearchG II Docs

Project page: MultiDBs-KeywordSearchG II

INotebook

Utilize IPython (IPython Notebook, Project Jupyter) tools on top of registered data stores.

GitHub page: MultiDBs-INotebook GitHub

More info&docs: MultiDBs-INotebook Docs

Project page: MultiDBs-INotebook

Files&APIs2DBs

Ingest data from different files/API format to a relational databases and expose those databases as data source that can be registered in the MetaStore.

GitHub page: MultiDBs-FilesAPIs2DBs GitHub

More info&docs: MultiDBs-FilesAPIs2DBs Docs

Project page: MultiDBs-FilesAPIs2DBs

Visualization

Visualize the data from the data sources registered in the MetaStore.

GitHub page: MultiDBs-Vis GitHub

More info&docs: MultiDBs-Vis Docs

Project page: MultiDBs-Vis

Icons made by Icons8 from www.flaticon.com is licensed by CC BY 3.0

Use Cases

Use Case #1 - Registering a datasource in MetaStore

Open MetaStore page, enter connection info to a MySQL database, title, description
MetaStore then should send REST request to PrestoDB and keyword search groups telling them that a new datasource was added
PrestoDB and keyword search groups need to take appropriate actions (e.g. create new PrestoDB catalog, update indexes, etc.)

Use Case #2 - Upload SPSS file and have it registered as a new datasource

Let user upload SPSS file and some metadata (e.g. title, description)
Process and store SPSS file as a new database
Send REST request to MetaStore to register the database as a new datasource (note the database physically will not be moved to MetaStore)
MetaStore then do what it does after manual datasource registration (send REST requests to other related groups)

Use Case #3 - Get dataset from Dataverse via API and have it registered as a new datasource

Let user enter datasets identifier
Talk to Dataverse API and get dataset’s metadata and data and store it on local database
Send REST request to MetaStore to register the database as a new datasource (note the database physically will not be moved to MetaStore)
MetaStore then do what it does after manual datasource registration (send REST requests to other related groups)

Use Case #4 - Querying with PrestoDB

Show user available datasources with their metadata (table names, column names) by getting this list from MetaStore project
Let user type SQL query and execute it

The query should not know about PrestoDB catalog names, but instead use MetaStore unique identifiers

Show paginated table result

Use Case #5 - Querying with PrestoDB and saving result as a new datasource

At first Use Case 4
Let user save the query result as a materialized database

Let user type database connection info where to save query results (the database might not exist)

for that create REST API to accept SQL query, Target DB info, Additional Info (title, description, etc.)

Once the query results are saved, send REST request to MetaStore to register new datasource (see previous use cases)

In the future can save SQL queries (basically as views)

Use Case #6 - REST API to run query

Use Case 4 should be implicitly implement this.
Provide REST API to accept a query as parameter and return result of this query as paginated table

Use Case #7 - Keyword search

Let user typo one or more keywords
Show list of datasets that contain those keywords

each result item should include datasource name (id), table name, column name where keyword(s) was(were) found

if it a merged dataset, repeat the above info for each datasource

Use Case #8 - Search Result: Show data as table

Based on Use Case 7
Add a button to each search result to show data as table

by clicking on the button, send SQL query via REST request to Presto project, it will execute the query and give you back the result that you need to show as simple HTML table.

Use Case #9 - Search Result: To Visualization page

Based on Use Case 7
Add a button to each one-dataset search result to visualize it

by clicking on the button, need to open Visualization page while also providing dataset id so visualization can automatically as a story to the new canvas
Visualization project should provide appropriate API (see use case Use Case 18)

Use Case #10 - Search Result: To INotebook

Based on Use Case 7
Add a button to each one-dataset search result to analyze it

by clicking on the button, need to open INotebook page while also providing dataset id so INotebook group can automatically create a notebook and import the database
INotebook project should provide appropriate API (see Use Case 13)

Use Case #11 - Search Result: To Save

Based on Use Case 7
Add a button to each multi-dataset search result (merged table) to save it as a new datasource

by clicking on the button, need to ask user for additional info (similar to the PrestoDB group Use Case 5
send REST request to PrestoDB group to save as new datasource (see Use Case 5)

Use Case #12 - INotebook

INotebook allow users to see available datasources (get them from MetaStore)
Allow to keyword search for specific datasource (by sending keyword query to one of the keyword search groups)
Allow users to create notebooks and use those datasources as data to do analyze on