Views in Data Analysis with SQL

Nicolas Meseth
3 min readDec 7, 2020

--

In this article, I introduce SQL views to enable reuse of queries you often need. I use the Databricks Community Edition for most of my data analyses, and I refer to the platform in this article. Databricks offers interactive and multi-language notebooks that allow you to switch between SQL, Scala, Python, and R.

Why Use Views?

In data analysis with SQL, we often have the need to make changes to our source data and its columns. For example, we might want to convert a column in string format to a date format. To accomplish this, we could apply different SQL functions to the string column so that the correct format is obtained at the end.

In order not to have to repeat such changes in each subsequent query, which would make our queries more complicated and less readable, we can create so-called Views in SQL. A view is an SQL query that we assign a name. Using this name, we can then access the result of the query as if the view were a separate table.

Creating Views

The SQL command to create a permanent view is as follows:

CREATE OR REPLACE VIEW my_view AS
SELECT <columns>
FROM <table>

A permanent view is stored in the default database in Databricks and thus remains there even if you start a new cluster or log off the system. In contrast, there are also temporary views that exist only for the duration of the cluster session and disappear afterwards:

CREATE OR REPLACE TEMPORARY VIEW my_temp_view AS
SELECT <columns>
FROM <table>

In the end, it’s up to you whether you want to create a permanent or temporary view. If you’ll likely use a view in many notebooks, it is a good idea to create it as a permanent view, so you don’t have to jump into the notebook with the original view definition every time.

Instead of CREATE OR REPLACE VIEW you can also write CREATE VIEW. In the latter case, you will get an error message if a view with this name already exists.

Deleting Views

You can delete a view, whether permanent or temporary, at any time:

DROP VIEW my_view

Using Views

Once the view is created, you can use it in your queries like any other table:

SELECT * FROM my_view

When the query is executed, the SQL query of the view is first executed in the background. This means that with complex views and large datasets the queries can become slower because the view has to be queried again each time. In this case, I recommend you save the result of the view as a new table.

Persistence of Query Results

You can save the result of any SQL query as a new table instead of a view. This has advantages if the query runs for a long time and you use the result often in subsequent analyses.

To create a new table from a query, we have to switch to the Python language. Fortunately, this is easily done in Databricks:

%python
# Defines the SQL query for the new table
query = “SELECT * FROM my_view”
# Deletes a possibly already existing table
spark.sql(“drop table if exists my_view_as_table”)
# Executes the query and saves result to variable (dataframe)
df = spark.sql(query)
# Creates a new table based on the dataframe
df.write.saveAsTable(“my_view_as_table”)

After you run this code, you can query the new table my_view_as_table just like any other table.

That’s it! You now know the basics about views and how to apply them. Let me know on Twitter if you have any questions. You can reach me at @nicolasmeseth. If you found this article useful I’d very much appreciate any claps and a follow.

--

--

Nicolas Meseth

Father of two boys. Professor in Computer Science Education.