About
This is the website for the INFSCI 2711: Advanced Topics in Database Management class at School of Information Sciences, University of Pittsburgh.Topics that are covered in this class:
- Data Integration (OLAP and Data Warehousing, Virtual Data Integration)
- Distributed and Parallel Databases (including distributed transactions and query execution)
- NoSQL databases, NewSQL databases (Main Memory Databases)
- Cluster Computing (Hadoop and other animals, Spark)
Project - MultiDBs
Basic Idea
We have a bunch of distributed heterogenous data sources (RDBMS, data files (excel, csv, dat, spss, etc)). We want to keep track of all sources, be able to keyword search throughout all of them, write complex SQL queries to combined data from several of them, be able to use/program complex data analysis algorithms and be able to visualize data in various forms and shapes.
One big project for the whole class. Each team work on independent components. Teams need to collaborate on the communication protocols between the components
Open Source
All code is open-source and hosted on github under infsci2711 organizaiton
Teams
MetaStore
A catalog to keep connection information to all data stores and some statistics about them.
GitHub page: MultiDBs-MetaStore GitHub
More info&docs: MultiDBs-MetaStore Docs
Project page: MultiDBs-MetaStore
Query
Utilize PrestoDB SQL engine to query data stores registered in the MetaStore catalog.
GitHub page: MultiDBs-Query GitHub
More info&docs: MultiDBs-Query Docs
Project page: MultiDBs-Query
Keyword Search Relational
Google like keyword search through all registered stores based on relational index approach.
GitHub page: MultiDBs-KeywordSearchR GitHub
More info&docs: MultiDBs-KeywordSearchR Docs
Project page: MultiDBs-KeywordSearchR
Keyword Search Graph I
The same as Keyword Search Relational, but try to do it in a different way by using graph database (Neo4j) to maintain schema graph.
GitHub page: MultiDBs-KeywordSearchG GitHub
More info&docs: MultiDBs-KeywordSearchG I Docs
Project page: MultiDBs-KeywordSearchG I
Keyword Search Graph II
The same as Keyword Search Relational, but try to do it in a different way by using graph database (Neo4j) to maintain schema graph.
GitHub page: MultiDBs-KeywordSearchG GitHub
More info&docs: MultiDBs-KeywordSearchG II Docs
Project page: MultiDBs-KeywordSearchG II
INotebook
Utilize IPython (IPython Notebook, Project Jupyter) tools on top of registered data stores.
GitHub page: MultiDBs-INotebook GitHub
More info&docs: MultiDBs-INotebook Docs
Project page: MultiDBs-INotebook
Files&APIs2DBs
Ingest data from different files/API format to a relational databases and expose those databases as data source that can be registered in the MetaStore.
GitHub page: MultiDBs-FilesAPIs2DBs GitHub
More info&docs: MultiDBs-FilesAPIs2DBs Docs
Project page: MultiDBs-FilesAPIs2DBs
Visualization
Visualize the data from the data sources registered in the MetaStore.
GitHub page: MultiDBs-Vis GitHub
More info&docs: MultiDBs-Vis Docs
Project page: MultiDBs-Vis
Use Cases
- Open MetaStore page, enter connection info to a MySQL database, title, description
- MetaStore then should send REST request to PrestoDB and keyword search groups telling them that a new datasource was added
- PrestoDB and keyword search groups need to take appropriate actions (e.g. create new PrestoDB catalog, update indexes, etc.)
- Let user upload SPSS file and some metadata (e.g. title, description)
- Process and store SPSS file as a new database
- Send REST request to MetaStore to register the database as a new datasource (note the database physically will not be moved to MetaStore)
- MetaStore then do what it does after manual datasource registration (send REST requests to other related groups)
- Let user enter datasets identifier
- Talk to Dataverse API and get dataset’s metadata and data and store it on local database
- Send REST request to MetaStore to register the database as a new datasource (note the database physically will not be moved to MetaStore)
- MetaStore then do what it does after manual datasource registration (send REST requests to other related groups)
- Show user available datasources with their metadata (table names, column names) by getting this list from MetaStore project
- Let user type SQL query and execute it
- The query should not know about PrestoDB catalog names, but instead use MetaStore unique identifiers
- Show paginated table result
- At first Use Case 4
- Let user save the query result as a materialized database
- Let user type database connection info where to save query results (the database might not exist)
- for that create REST API to accept SQL query, Target DB info, Additional Info (title, description, etc.)
- Once the query results are saved, send REST request to MetaStore to register new datasource (see previous use cases)
- In the future can save SQL queries (basically as views)
- Use Case 4 should be implicitly implement this.
- Provide REST API to accept a query as parameter and return result of this query as paginated table
- Let user typo one or more keywords
- Show list of datasets that contain those keywords
- each result item should include datasource name (id), table name, column name where keyword(s) was(were) found
- if it a merged dataset, repeat the above info for each datasource
- Based on Use Case 7
- Add a button to each search result to show data as table
- by clicking on the button, send SQL query via REST request to Presto project, it will execute the query and give you back the result that you need to show as simple HTML table.
- Based on Use Case 7
- Add a button to each one-dataset search result to visualize it
- by clicking on the button, need to open Visualization page while also providing dataset id so visualization can automatically as a story to the new canvas
- Visualization project should provide appropriate API (see use case Use Case 18)
- Based on Use Case 7
- Add a button to each one-dataset search result to analyze it
- by clicking on the button, need to open INotebook page while also providing dataset id so INotebook group can automatically create a notebook and import the database
- INotebook project should provide appropriate API (see Use Case 13)
- Based on Use Case 7
- Add a button to each multi-dataset search result (merged table) to save it as a new datasource
- by clicking on the button, need to ask user for additional info (similar to the PrestoDB group Use Case 5
- send REST request to PrestoDB group to save as new datasource (see Use Case 5)
- INotebook allow users to see available datasources (get them from MetaStore)
- Allow to keyword search for specific datasource (by sending keyword query to one of the keyword search groups)
- Allow users to create notebooks and use those datasources as data to do analyze on
- INotebook provide API to open INotebook page for a given datasource (given by datasource id), see Use Case 10
- need to get connection info from MetaStore by using provided id.
- INotebook to save modified datasets as new datasources
- Ask user for additional info (similar to PrestoDB group)
- target database connection info (where to save)
- metadata: title, description
- Save the data in the target DB
- Register new DB as a datasource in MetaStore via REST API (similar to other Use Cases, e.g. Use Case 2)
- Let users to manage canvases (create and delete)
- Let users to add stories (datasources) to canvases
- Get list of available datasources from MetaStore
- Let users search for a datasource by a keyword search
- send keyword search and use results of one of the keyword search group
- Let users create different types of charts on a canvas
- canvases should have similar functionality/look&feel as the old visualization
- Create new canvas and add story automatically based on the given datasource id (See Use Case 9)
- provide API that accepts datasource id
Lecture Notes
Coming soon
Students Presentations
Coming soon
Homework assignments
Coming soon
Blog Posts
-
Tutorial on how to use JerseyClientUtil to communicate to other projects via REST API from java code
-
Tutorial on how to configure servers with external properties file
-
Tutorial on how to diploy projects to AWS Ubuntu servers
-
Tutorial on simple Web application with RESTful API on JAVA
subscribe via RSS