Day 1 Practice Problems

You are not logged in.
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://petrock.mit.edu) to authenticate, and then you will be redirected back to this page.

Practice the concepts you learned in Day 1

1) Data Science Pipeline

What practices are done in Structure Extraction? (Select all that apply)

Image Segmentation

Signal Processing

Querying/Processing

Regularization

Visualization/Presentation

Re-sampling

Cleaning

Outlier Removal

2) Tables

According to the principles of 'good' table design, why is high redundancy (repeating information) problematic?

It violates the principle of compactness and risks inconsistency

It prevents the use of primary keys in the the table

It automatically changes the column types to strings

It makes the table impossible for tools like Pandas to ingest

If you have a many-to-many relationship, what is a good general approach?

Create a relationship table to eliminate redundancy

Add a reference column to the table

Create tables for each relationship

Do not allow for many-to-many relationships in your data

Which relational algebra operation is used to filter out rows based on a specific predicate (condition)?

Join

Cross Product

Projection

Selection

        # Perform a projection on groups to retrieve only the list of group names. 
        # Submit your answer as a python iterable
        groups = [
            {"gid": 101, "group_name": "Hikers"},
            {"gid": 102, "group_name": "Coders"}
        ]

3) SQL and Pandas

In SQL, what is the result of using a LEFT JOIN on Table A and Table B if a row in Table A has no matching record in Table B?

The database returns an error

The row from Table A is duplicated for every row in Table B

The row from Table A is discarded from the final result

The row from Table A is kept and Table B's columns are filled with 'NULL'

When performing an aggregate query in SQL, what is the functional difference between COUNT(*) and COUNT(column_name)?

Count(*) is only for primary keys, while COUNT(column_name) works for any column

COUNT(*) counts every row, while COUNT(coumn_name) ignores NULL values in that column

There is no difference

COUNT(*) returns the number of columns, while COUNT(column_name) returns the number of rows

Which of the following describes a 'Analytics' workload as opposed to an 'Transaction' workload? (Select all that apply)

Requires mechanism to prevent concurrent updates to the same data

Batch updates

Large scale queries

CRUD workloads

Focus on minimizing the amount of data we need to read

How does a B-Tree index improve the performance of a query looking for a specific value?

It compresses the data into a binary format that takes up less space

It automatically removes duplicate records from the table

It sorts the entire heap file every time a new record is inserted

It allows the database to find records in O(logN) time instead of scanning the whole file

In Pandas, what is the specific purpose of the iloc indexer compared to the loc indexer?

iloc can only access rows, while loc can access both rows and columns

iloc uses integer-based positional indexing, while loc uses label-based indexing

iloc is used for modifying data, while loc is only used f reading data

iloc is an older,deprecated version of loc that should be avoided