About

This is the website for the INFSCI 2711: Advanced Topics in Database Management class at School of Information Sciences, University of Pittsburgh.

Topics that are covered in this class:

  • Data Integration (OLAP and Data Warehousing, Virtual Data Integration)
  • Distributed and Parallel Databases (including distributed transactions and query execution)
  • NoSQL databases, NewSQL databases (Main Memory Databases)
  • Cluster Computing (Hadoop and other animals, Spark)


Project - MultiDBs

Basic Idea

We have a bunch of distributed heterogenous data sources (RDBMS, data files (excel, csv, dat, spss, etc)). We want to keep track of all sources, be able to keyword search throughout all of them, write complex SQL queries to combined data from several of them, be able to use/program complex data analysis algorithms and be able to visualize data in various forms and shapes.

One big project for the whole class. Each team work on independent components. Teams need to collaborate on the communication protocols between the components

Open Source

All code is open-source and hosted on github under infsci2711 organizaiton

Teams

MetaStore

A catalog to keep connection information to all data stores and some statistics about them.

GitHub page: MultiDBs-MetaStore GitHub

More info&docs: MultiDBs-MetaStore Docs

Project page: MultiDBs-MetaStore

Query

Utilize PrestoDB SQL engine to query data stores registered in the MetaStore catalog.

GitHub page: MultiDBs-Query GitHub

More info&docs: MultiDBs-Query Docs

Project page: MultiDBs-Query

Keyword Search Relational

Google like keyword search through all registered stores based on relational index approach.

GitHub page: MultiDBs-KeywordSearchR GitHub

More info&docs: MultiDBs-KeywordSearchR Docs

Project page: MultiDBs-KeywordSearchR

Keyword Search Graph I

The same as Keyword Search Relational, but try to do it in a different way by using graph database (Neo4j) to maintain schema graph.

GitHub page: MultiDBs-KeywordSearchG GitHub

More info&docs: MultiDBs-KeywordSearchG I Docs

Project page: MultiDBs-KeywordSearchG I

Keyword Search Graph II

The same as Keyword Search Relational, but try to do it in a different way by using graph database (Neo4j) to maintain schema graph.

GitHub page: MultiDBs-KeywordSearchG GitHub

More info&docs: MultiDBs-KeywordSearchG II Docs

Project page: MultiDBs-KeywordSearchG II

INotebook

Utilize IPython (IPython Notebook, Project Jupyter) tools on top of registered data stores.

GitHub page: MultiDBs-INotebook GitHub

More info&docs: MultiDBs-INotebook Docs

Project page: MultiDBs-INotebook

Files&APIs2DBs

Ingest data from different files/API format to a relational databases and expose those databases as data source that can be registered in the MetaStore.

GitHub page: MultiDBs-FilesAPIs2DBs GitHub

More info&docs: MultiDBs-FilesAPIs2DBs Docs

Project page: MultiDBs-FilesAPIs2DBs

Visualization

Visualize the data from the data sources registered in the MetaStore.

GitHub page: MultiDBs-Vis GitHub

More info&docs: MultiDBs-Vis Docs

Project page: MultiDBs-Vis

Icons made by Icons8 from www.flaticon.com is licensed by CC BY 3.0

Use Cases

  • Open MetaStore page, enter connection info to a MySQL database, title, description
  • MetaStore then should send REST request to PrestoDB and keyword search groups telling them that a new datasource was added
  • PrestoDB and keyword search groups need to take appropriate actions (e.g. create new PrestoDB catalog, update indexes, etc.)
  • Let user upload SPSS file and some metadata (e.g. title, description)
  • Process and store SPSS file as a new database
  • Send REST request to MetaStore to register the database as a new datasource (note the database physically will not be moved to MetaStore)
  • MetaStore then do what it does after manual datasource registration (send REST requests to other related groups)
  • Let user enter datasets identifier
  • Talk to Dataverse API and get dataset’s metadata and data and store it on local database
  • Send REST request to MetaStore to register the database as a new datasource (note the database physically will not be moved to MetaStore)
  • MetaStore then do what it does after manual datasource registration (send REST requests to other related groups)
  • Show user available datasources with their metadata (table names, column names) by getting this list from MetaStore project
  • Let user type SQL query and execute it
    • The query should not know about PrestoDB catalog names, but instead use MetaStore unique identifiers
  • Show paginated table result
  • At first Use Case 4
  • Let user save the query result as a materialized database
    • Let user type database connection info where to save query results (the database might not exist)
      • for that create REST API to accept SQL query, Target DB info, Additional Info (title, description, etc.)
    • Once the query results are saved, send REST request to MetaStore to register new datasource (see previous use cases)
  • In the future can save SQL queries (basically as views)
  • Use Case 4 should be implicitly implement this.
  • Provide REST API to accept a query as parameter and return result of this query as paginated table
  • Let user typo one or more keywords
  • Show list of datasets that contain those keywords
    • each result item should include datasource name (id), table name, column name where keyword(s) was(were) found
      • if it a merged dataset, repeat the above info for each datasource
  • Based on Use Case 7
  • Add a button to each search result to show data as table
    • by clicking on the button, send SQL query via REST request to Presto project, it will execute the query and give you back the result that you need to show as simple HTML table.
  • Based on Use Case 7
  • Add a button to each one-dataset search result to visualize it
    • by clicking on the button, need to open Visualization page while also providing dataset id so visualization can automatically as a story to the new canvas
    • Visualization project should provide appropriate API (see use case Use Case 18)
  • Based on Use Case 7
  • Add a button to each one-dataset search result to analyze it
    • by clicking on the button, need to open INotebook page while also providing dataset id so INotebook group can automatically create a notebook and import the database
    • INotebook project should provide appropriate API (see Use Case 13)
  • Based on Use Case 7
  • Add a button to each multi-dataset search result (merged table) to save it as a new datasource
    • by clicking on the button, need to ask user for additional info (similar to the PrestoDB group Use Case 5
    • send REST request to PrestoDB group to save as new datasource (see Use Case 5)
  • INotebook allow users to see available datasources (get them from MetaStore)
  • Allow to keyword search for specific datasource (by sending keyword query to one of the keyword search groups)
  • Allow users to create notebooks and use those datasources as data to do analyze on
  • INotebook provide API to open INotebook page for a given datasource (given by datasource id), see Use Case 10
    • need to get connection info from MetaStore by using provided id.
  • INotebook to save modified datasets as new datasources
    • Ask user for additional info (similar to PrestoDB group)
      • target database connection info (where to save)
      • metadata: title, description
    • Save the data in the target DB
    • Register new DB as a datasource in MetaStore via REST API (similar to other Use Cases, e.g. Use Case 2)
  • Let users to manage canvases (create and delete)
  • Let users to add stories (datasources) to canvases
    • Get list of available datasources from MetaStore
    • Let users search for a datasource by a keyword search
      • send keyword search and use results of one of the keyword search group
  • Let users create different types of charts on a canvas
    • canvases should have similar functionality/look&feel as the old visualization
  • Create new canvas and add story automatically based on the given datasource id (See Use Case 9)
    • provide API that accepts datasource id

Lecture Notes

Coming soon


Students Presentations

Coming soon


Homework assignments

Coming soon


Blog Posts

subscribe via RSS