SQL

18 分钟阅读

SQL

https://linuxcommand.org/lc3_adv_sql.php

Okay kids, gird your grid for a big one.

The world as we know it is powered by data. Data, in turn, is stored in databases. Most everything we use computers for involves using data stored in some kind of database. When we talk about storing large amounts of data, we often mean relational database management systems (RDBMS). Banks, insurance companies, most every accounting system, and many, many websites use relational databases to store and organize their data.

The idea of relational data storage is generally credited to English computer scientist and IBM researcher E. F. Cobb, who proposed it in a paper in 1970. In the years that followed, a number of software companies built relational database systems with varying degrees of success. Around 1973, IBM developed a simplified and flexible method of managing relational databases called Structured Query Language (SQL, often pronounced “sequel”). Today the combination of RDBMS and SQL is a huge industry, generating many billions of dollars every year.

Relational databases are important to the history of Linux as well. It was the availability of open source database programs (such as MySQL) and web servers (most notably, Apache) that led to an explosion of Linux adoption in the early days of the world wide web.

In this adventure, we’re going to study the fundamentals of relational databases and use a popular command line program to create and manage a database of our own. The AWK adventure is a suggested prerequisite.

A Little Theory

Before we can delve into SQL we have to look at what relational databases are and how they work.

Tables

Simply put, a relational database is one or more tables containing columns and rows. Technically, the columns are known as attributes and the rows as tuples, but most often they are simply called columns and rows. In many ways, a table resembles the familiar spreadsheet. In fact, spreadsheet programs are often used to prepare and edit database tables. Each column contains data of a consistent type, for example, one column might consist of integers and another column strings. A table can contain any number of rows.

Schemas

The design of a database is called its schema and it can be simple, containing just a single table or two, or it can be complex, containing many tables with complex interrelationships between them.

Let’s imagine a database for a bookstore consisting of three tables. The first is called Customers, the second is called Items, and the third is called Sales. The Customers table will have multiple rows with each row containing information about one customer. The columns include a customer number, first and last names, and the customer’s address. Here is such a table with just some made-up names:

Cust First   Last     Street                City        ST
---- ------- -------- --------------------- ----------- --
0001 Richard Stollman 1 Outonthe Street     Boston      MA
0002 Eric    Roymond  2 Amendment Road      Bunker Hill PA
0003 Bruce   Porens   420 Middleville Drive Anytown     US

The Items table lists our books and contains the item number, title, author, and price.

Item Title                                  Author        Price
---- -------------------------------------- ------------- -----
1001 Winning Friends and Influencing People Dale Carnegie 14.95
1002 The Communist Manifesto                Marx & Engels 00.00
1003 Atlas Shrugged                         Ayn Rand      99.99

As we go about selling items in our imaginary bookstore, we generate rows in the Sales table. Each sale generates a row containing the customer number, date and time of the sale, the item number, the quantity sold, and the total amount of the sale.

Cust Date_Time    Item Quan Total
---- ------------ ---- ---- ------
0002 202006150931 1003    1  99.99
0001 202006151108 1002    1   0.00
0003 202006151820 1001   10 149.50

Keys

Now we might be wondering what the Cust and Item columns are for. They serve as keys. Keys are values that serve to uniquely identify a table row and to facilitate interrelationships between tables. Keys have some special properties. They must be both unique (that is, they can appear only once in a table and specifically identify a row) and they must also be immutable (they can never change). If they can’t meet these requirements, they cannot be keys. Some database implementations have methods of enforcing these requirements during table creation and keys can be formally specified. In the case of our bookstore database, the Cust column contains the keys for the Customers table and the Item column contains the keys for the Items table.

Knowing this about keys, we can now understand why the Sales table works the way it does. We can see for example that row 1 of the Sales table tells us that customer 0002 purchased 1 copy of item 1003 for $99.99. So why do we need special values for the keys? Why not, for instance, just use the customer’s name as the key? It’s because we can’t guarantee that the name won’t change, or that two people might have the same name. We can guarantee that an arbitrarily assigned value like our customer number is unique and immutable.

Database Engines/Servers

There are a number of database servers available for Linux. The two most prominent are MySQL (and its fork MariaDB) and PostgreSQL. These database servers implement client/server architecture, a concept that became a hot topic in the 1990s. Database servers of this type run as server processes and clients connect to them over a network connection and send SQL statements for the server to carry out. The results of the database operations are returned to the client program for further processing and presentation. Many web sites use this architecture with a web server sending SQL statements to a separate database server to dynamically create web pages as needed. The famous LAMP stack consisting of Linux, Apache web server, MySQL, and PHP powered much of the early web.

For purposes of this adventure, setting up database servers such as MySQL and PostgreSQL is too complicated to cover here since, among other things, they support multiple concurrent users and their attendant security controls. It’s more than we need for just learning SQL.

sqlite3

The database server we will be using is SQLite. SQLite is a library that can be used with applications to provide a high-level of SQL functionality. It’s very popular with the embedded systems crowd. It saves application developers the trouble of writing custom solutions to their data storage and management tasks. In addition to the library, SQLite provides a command line tool for directly interacting with SQLite databases. Also, since it accepts SQL from standard input (as well as it own command line interface) and sends its results to standard output, it’s ideal for use in our shell scripts.

SQLite is available from most Linux distribution repositories. Installation is easy, for example:

me@linuxbox:~$ sudo apt install sqlite3

Building a Playground

Let’s build a playground and play with some real data. On the LinuxCommand.org site there is a archive we can download that will do the trick.

me@linuxbox:~$ cd
me@linuxbox:~$ curl -c http://linuxcommand.org/adventure-sql.tgz
me@linuxbox:~$ tar xzf adventure-sql.tgz
me@linuxbox:~$ cd adventure-sql

Extracting the .tgz file will produce the playground directory containing the data sets, some demonstration code, and some helpful tools. The data sets we will be working with contain the listings of installed packages on an Ubuntu 18.04 system. This will include the name of packages, a short description of each one, a list of files contained in each package, and their sizes.

me@linuxbox:~/adventure-sql$ ls

All the files are human-readable text, so feel free to give them a look. The data sets in the archive are the .tsv files. These are tab-separated value files. The first one is the package_descriptions.tsv file. This contains two columns of data; a package name and a package description. The second file, named package_files.txv, has three columns: a package name, the name of a file installed by the package and the size of the file.

Starting sqlite3

To launch SQLite, we simply issue the command sqlite3 followed optionally by the name of a file to hold our database of tables. If we do not supply a file name, SQLite will create a temporary database in memory.

me@linuxbox:~/advemture-sql$ sqlite3
SQLite version 3.22.0 2018-01-22 18:45:57
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite>

When loading is complete, SQLite will present a prompt where we can enter commands. Commands can be either SQL statements or dot commands that are used to control SQLite itself. To see a list of the available dot commands, we enter .help at the prompt.

sqlite> .help

There are only a few of the dot commands that will be of interest to us and they deal mainly with how output is formatted. To exit sqlite3, we enter the dot command .quit at the prompt.

sqlite> .quit

Though we can interact directly with the sqlite3 program using the sqlite> prompt, sqlite3 can also accept streams of dot commands and SQL statements through standard input. This is how SQLite is most often used.

Creating a Table and Inserting Our Data

To get started with our database, we need to first convert our .tsv files into a stream of SQL statements. Our database will initially consist of two tables. The first is named Package_Descriptions and the second is named Package_Files. To create the SQL stream for the Package_Descriptions table we will use the insert_Package_Descriptions.awk program supplied in the playground archive.

me@linuxbox:~/advemture-sql$ ./insert_Package_Descriptions.awk \
< package_descriptions.tsv > insert_Package_Descriptions.sql

Let’s take a look at the resulting SQL stream. We’ll use the head command to display the first few lines of the stream.

me@linuxbox:~/advemture-sql$ head insert_Package_Descriptions.sql
DROP TABLE IF EXISTS Package_Descriptions;
CREATE TABLE Package_Descriptions (
    package_name VARCHAR(60),
    description  VARCHAR(120)
);
BEGIN TRANSACTION;
INSERT INTO Package_Descriptions
     VALUES ( 'a2ps',
              'GNU a2ps - ''Anything to PostScript'' converter and pretty-prin
ter');
INSERT INTO Package_Descriptions
     VALUES ( 'accountsservice',
              'query and manipulate user account information');

And the last few lines using the tail command.

me@linuxbox:~/advemture-sql$ tail insert_Package_Descriptions.sql
     VALUES ( 'zlib1g:amd64',
              'compression library - runtime');
INSERT INTO Package_Descriptions
     VALUES ( 'zlib1g:i386',
              'compression library - runtime');
INSERT INTO Package_Descriptions
     VALUES ( 'zlib1g-dev:amd64',
              'compression library - development');
INSERT INTO Package_Descriptions
     VALUES ( 'zsh',
              'shell with lots of features');
INSERT INTO Package_Descriptions
     VALUES ( 'zsh-common',
              'architecture independent files for Zsh');
COMMIT;

As we can see, SQL is verbose and somewhat English-like. Convention dictates that language keywords be in uppercase; however it is not required. SQL is case insensitive. White space is not important, but is often used to make the SQL statements easier to read. Statements can span multiple lines but don’t have to. Statements are terminated by a semicolon character. The SQL in this adventure is generally formatted in accordance with the style guide written by Simon Holywell linked in the “Further Reading” section below. Since some SQL can get quite complicated, visual neatness counts when writing code.

SQL supports two forms of commenting.

-- Single line comments are preceeded by 2 dashes

/* And multi-line comments are done in the
   style of the C programming language */

Before we go on, we need to digress for a moment to discuss SQL as a standard. While there are ANSI standards for SQL, every database server implements SQL differently. Each one has a slightly different dialect. The reason for this is partly historical; in the early days there weren’t any firm standards, and partly commercial. Each database vendor wanted to make it hard for customers to migrate to competing products so each vendor added unique extensions and features to the language to promote the dreaded “vendor lock-in” for their product. SQLite supports most of standard SQL (but not all of it) and it adds a few unique features.

Creating and Deleting Tables

The first 2 lines of our SQL stream deletes any existing table named Package_Descriptions and creates a new table with that name. The DROP TABLE statement deletes a table. The optional IF EXISTS clause is used to prevent errors if the table does not already exist. There are a lot of optional clauses in SQL. The CREATE TABLE statement defines a new table. As we can see, this table will have 2 columns. The first column, package_name is defined to be a variable length string up to 60 characters long. VARCHAR is one of the available data types we can define. Here are some of the common data types supported by SQL databases:

Data Type	Description
INTEGER	Integer
CHAR(n)	Fixed length string
VARCHAR(n)	Variable length string
NUMERIC	Decimal numbers
REAL	Floating point numbers
DATETIME	Date and time values

Data Types

SQL databases support many types of data. Unfortunately, this varies by vendor. Even in cases where two databases share the same data type name, the actual meaning of the data type can differ. Data types in SQLite, unlike other SQL databases, are not rigidly fixed. Values in SQLite are dynamically typed. While SQLite allows many of the common data types found in other databases to be specified, it actually only supports 4 general types of data storage.

Data Type	Description
INTEGER	Signed integers using 1, 2, 3, 4, 6, or 8 bytes as needed
REAL	8-byte IEEE floating point numbers
TEXT	Text strings
BLOB	Binary large objects (for example JPEG, or MP3 files)

In our example above, we specified VARCHAR as the data type for our columns. SQLite is perfectly happy with this, but it actually stores the values in these columns as just TEXT. It ignores the length restrictions set in the data type specification. SQLite is extremely lenient about data types. In fact, it allows any kind of data to be stored in any specified data type, even allowing a mixture of data types in a single column. This is completely incompatible with all other databases, and relying on this would be very bad practice. In the remainder of this adventure we will be sticking to conventional data types and behavior.

Inserting Data

Moving on with our SQL stream, we see that the majority of the stream consists of INSERT statements. This is how rows are added to a table. Insert is sort of a misnomer as INSERT statements append rows to a table.

We surround the INSERT statements with BEGIN TRANSACTION and COMMIT. This is done for performance reasons. If we leave these out, the rows will still be appended to the table but each INSERT will be treated as a separate transaction, vastly increasing the amount of time it takes to append a large number of rows. Treating a transaction this way also has another important benefit. SQL does not apply the transaction to the database until it receives the COMMIT statement, thus it is possible to write SQL code that will abandon a transaction if there is a problem and the change will be rolled back leaving the database unchanged.

Let’s go ahead and create our first table and add the package names and descriptions.

me@linuxbox:~/advemture-sql$ sqlite3 adv-sql.sqlite \
< insert_Package_Descriptions.sql

We execute the sqlite3 program specifying adv-sql.sqlite as the file used to store our tables. The choice of file name is arbitrary. We read our SQL stream into standard input and sqlite3 carries out the statements.

Doing Some Queries

Now that we have a database (albeit a small one), let’s take a look at it. To do this, we will start up sqlite3 and interact with it at the prompt.

me@linuxbox:~/advemture-sql$ sqlite3 adv-sql.sqlite
SQLite version 3.22.0 2018-01-22 18:45:57
Enter ".help" for usage hints.
sqlite>

We’ll first use some SQLite dot commands to examine the structure of the database.

sqlite> .tables
Package_Descriptions
sqlite> .schema Package_Descriptions
CREATE TABLE Package_Descriptions (
    package_name VARCHAR(60),
    description  VARCHAR(120)
);

The .tables dot command displays a list of tables in the database while the .schema dot command lists the statements used to create the specified table.

Next, we’ll get into some real SQL using SELECT, probably the most frequently used SQL statement.

sqlite> SELECT * FROM Package_Descriptions;
a2ps|GNU a2ps - 'Anything to PostScript' converter and pretty-printer
accountsservice|query and manipulate user account information
acl|Access control list utilities
acpi-support|scripts for handling many ACPI events
acpid|Advanced Configuration and Power Interface event daemon
adduser|add and remove users and groups
adium-theme-ubuntu|Adium message style for Ubuntu
adwaita-icon-theme|default icon theme of GNOME (small subset)
aisleriot|GNOME solitaire card game collection
alsa-base|ALSA driver configuration files
.
.
.

This is the simplest form of the SELECT statement. The syntax is the word SELECT followed by a list of columns or calculated values we want, and a FROM clause specifying the source of the data. This example uses * which means every column. Alternately, we could explicitly name the columns, like so:

sqlite> SELECT package_name, description FROM Package_Descriptions;

And achieve the same result.

Controlling the Output

As we can see from the output above, the default format is fine for further processing by tools such as awk, but it leaves a lot to be desired when it comes to being read by humans. We can adjust the output format with some dot commands. We’ll also add the LIMIT clause to the end of our query to output just 10 rows.

sqlite> .headers on
sqlite> .mode column
sqlite> SELECT * FROM Package_Descriptions LIMIT 10;
package_name  description                                                     
------------  ----------------------------------------------------------------
a2ps          GNU a2ps - 'Anything to PostScript' converter and pretty-printer
accountsserv  query and manipulate user account information                   
acl           Access control list utilities                                   
acpi-support  scripts for handling many ACPI events                           
acpid         Advanced Configuration and Power Interface event daemon         
adduser       add and remove users and groups                                 
adium-theme-  Adium message style for Ubuntu                                  
adwaita-icon  default icon theme of GNOME (small subset)                      
aisleriot     GNOME solitaire card game collection                            
alsa-base     ALSA driver configuration files

By using the .headers on and .mode column dot commands, we add the column headings and change the output to column format. These settings stay in effect until we change them. The .mode dot command has a number of interesting possible settings.

Mode Description

csv Comma-separated values

column Left-aligned columns. Use .width n1 n2… to set column widths.

html

HTMLcode

insert	SQL insert statements for TABLE
line	One value per line
list	Values delimited by .separator string. This is the default.
tabs	Tab-separated values
tcl	TCL (Tool Control Language) list elements

Here we will set the mode and column widths for our table.

sqlite> .mode column 
sqlite> .width 20 60 
sqlite> SELECT * FROM Package_Descriptions LIMIT 10;
package_name          description                                                 
--------------------  ---------------------------------------------------------
a2ps                  GNU a2ps - 'Anything to PostScript' converter and pretty-
accountsservice       query and manipulate user account information               
acl                   Access control list utilities                               
acpi-support          scripts for handling many ACPI events                       
acpid                 Advanced Configuration and Power Interface event daemon     
adduser               add and remove users and groups                             
adium-theme-ubuntu    Adium message style for Ubuntu                              
adwaita-icon-theme    default icon theme of GNOME (small subset)                  
aisleriot             GNOME solitaire card game collection                        
alsa-base             ALSA driver configuration files

In addition to listing columns, SELECT can be used to perform various output tasks. For example, we can perform calculations such as counting the number of rows in the Package_Descriptions table.

sqlite> SELECT COUNT(package_name) FROM Package_Descriptions;
count(package_name) 
--------------------
1972

Being Selective

We can make SELECT output rows based on some selection criteria. Either an exact match:

sqlite> SELECT * FROM Package_Descriptions WHERE package_name = 'bash';
package_name          description                                                 
--------------------  --------------------------------------------------------
bash                  GNU Bourne Again SHell

A partial match using SQL wildcard characters:

sqlite> SELECT * FROM Package_Descriptions WHERE description LIKE '%bash%';
package_name          description
--------------------  --------------------------------------------------------
bash-completion       programmable completion for the bash shell
command-not-found     Suggest installation of packages in interactive bash ses

SQL supports two wildcard characters. The underscore (_), which matches any single character, and the percent sign (%), which matches zero or more instances of any character.

Notice too that strings are surrounded with single quotes. If a value is quoted this way, SQL treats it as a string. For example, the value '123' would be treated as a string rather than a number.

Sorting Output

Unless we tell SELECT to sort our data, it will be listed in the order it was inserted into the table. Our data appears in alphabetical order by package name because it was inserted that way, not because of anything SQLite is doing. The ORDER BY clause can be added to determine which column is used for sorting. To demonstrate, let’s sort the output by the description,

sqlite> SELECT * FROM Package_Descriptions ORDER BY description LIMIT 10;
package_name          description
--------------------  -------------------------------------------------------
network-manager-conf
fonts-noto-cjk        "No Tofu" font families with large Unicode coverage (CJ
fonts-noto-mono       "No Tofu" monospaced font family with large Unicode cov
udev                  /dev/ and hotplug management daemon
procps                /proc file system utilities
alsa-base             ALSA driver configuration files
libasound2-plugins:a  ALSA library additional plugins
libhyphen0:amd64      ALTLinux hyphenation library - shared library
apcupsd               APC UPS Power Management (daemon)
apcupsd-doc           APC UPS Power Management (documentation/examples)

The default sorting order is ascending, but we can also sort in descending order by including DESC after the column name. Multiple columns can be named and ASC can be used to specify ascending order.

sqlite> SELECT * FROM Package_Descriptions ORDER BY description DESC LIMIT 10;
package_name  description                                       
------------  --------------------------------------------------
xsane-common  xsane architecture independent files              
libx264-152:  x264 video coding library                         
libevdev2:am  wrapper library for evdev devices                 
wireless-reg  wireless regulatory database                      
crda          wireless Central Regulatory Domain Agent          
libmutter-2-  window manager library from the Mutter window mana
libwayland-s  wayland compositor infrastructure - server library
libwayland-c  wayland compositor infrastructure - cursor library
libwayland-c  wayland compositor infrastructure - client library
libwayland-e  wayland compositor infrastructure - EGL library

Adding Another Table

To demonstrate more of what we can do with SELECT, we’re going to need a bigger database. We have a second .tsv file that we can add. To save a step, we’ll filter the file into SQL and pipe it directly into sqlite3.

me@linuxbox:~/advemture-sql$ ./insert_package_files.awk \ 
                                < package_files-deb.tsv \
                                | sqlite3 adv-sql.sqlite

The second table is named Package_Files. Here is its schema:

sqlite> .schema Package_Files
CREATE TABLE Package_Files (
    package_name VARCHAR(60),
    file         VARCHAR(120),
    size_bytes   INTEGER
);

As we can see, this table has 3 columns; the package name, the name of a file installed by the package, and the size of the installed file in bytes. Let’s do a SELECT to see how this table works.

sqlite> .headers on
sqlite> .mode column
sqlite> .width 15 50 -10
sqlite> SELECT * FROM Package_Files WHERE package_name = 'bash';
package_name     file                                                size_bytes
---------------  -------------------------------------------------  ----------
bash             /bin/bash                                             1113504
bash             /etc/bash.bashrc                                         2319
bash             /etc/skel/.bash_logout                                    220
bash             /etc/skel/.bashrc                                        3771
bash             /etc/skel/.profile                                        807
bash             /usr/bin/bashbug                                         7115
bash             /usr/bin/clear_console                                  10312
bash             /usr/share/doc/bash/COMPAT.gz                            7853
bash             /usr/share/doc/bash/INTRO.gz                             2921
bash             /usr/share/doc/bash/NEWS.gz                             27983
bash             /usr/share/doc/bash/POSIX.gz                             3702
bash             /usr/share/doc/bash/RBASH                                1693
bash             /usr/share/doc/bash/README                               3839
bash             /usr/share/doc/bash/README.Debian.gz                     1919
bash             /usr/share/doc/bash/README.abs-guide                     1105
bash             /usr/share/doc/bash/README.commands.gz                   3021
bash             /usr/share/doc/bash/changelog.Debian.gz                  1357
bash             /usr/share/doc/bash/copyright                           10231
bash             /usr/share/doc/bash/inputrc.arrows                        727
bash             /usr/share/lintian/overrides/bash                         156
bash             /usr/share/man/man1/bash.1.gz                           86656
bash             /usr/share/man/man1/bashbug.1.gz                          804
bash             /usr/share/man/man1/clear_console.1.gz                   1194
bash             /usr/share/man/man1/rbash.1.gz                            154
bash             /usr/share/man/man7/bash-builtins.7.gz                    508
bash             /usr/share/menu/bash                                      194
bash             /bin/rbash                                                  4

Notice the .width dot command above. A negative value causes the corresponding column to be right-aligned.

Subqueries

The SELECT statement can be used to produce many kinds of output. For example, it can be used to simply print literal strings.

sqlite> .mode column
sqlite> .header off
sqlite> SELECT 'String 1', 'String 2';
string 1    string 2

As we saw before, SELECT can produce calculated values.

sqlite> .header on
sqlite> SELECT 2 + 2;
2 + 2
--------------------
4
sqlite> SELECT COUNT(file), AVG(size_bytes) FROM Package_Files;
count(file)  avg(size_bytes)
-----------  ----------------
153506       33370.3488658424

To make complicated expressions more readable, we can assign their results to aliases by using the AS clause.

sqlite> SELECT COUNT(file) AS Files,
   ...> AVG(size_bytes) AS 'Average Size'
   ...> FROM Package_Files;
Files       Average Size    
----------  ----------------
153506      33370.3488658424

An important feature of SELECT is the ability to produce results by combining data from multiple tables. This process is done by performing joins and subqueries. We’ll talk about joins a little later, but for now let’s concentrate on subqueries. SELECT allows us to include another SELECT statement as an item to output. To demonstrate this, we will produce a table that includes columns for package name, number of files in the package, and the total size of the package. The SELECT statement to do this is rather formidable. We’ll open our text editor and create a file named subquery_demo1.sql with the following content:

-- subquery_demo1.sql

-- Query to list top 20 packages with the greatest numbers of files

.mode column
.header on
.width 20 40 10 10
SELECT package_name, description,
       (SELECT COUNT(file)
          FROM Package_Files
         WHERE Package_Descriptions.package_name = Package_Files.package_name)
            AS files,
       (SELECT SUM(size_bytes)
          FROM Package_Files
         WHERE Package_Descriptions.package_name = Package_Files.package_name)
            AS size
  FROM Package_Descriptions ORDER BY files DESC LIMIT 20;

We’ll next run this query and view the results.

me@linuxbox:~/adventure-sql$ sqlite3 adv-sql.sqlite < subquery_demo1.sql
package_name          description                       files       size      
--------------------  --------------------------------  ----------  ----------
linux-headers-4.15.0  Header files related to Linux ke  14849       63991787  
linux-headers-4.15.0  Header files related to Linux ke  14849       64001943  
humanity-icon-theme   Humanity Icon theme               8014        14213715  
linux-headers-4.15.0  Linux kernel headers for version  7861        9015084   
linux-headers-4.15.0  Linux kernel headers for version  7860        9025673   
linux-modules-extra-  Linux kernel extra modules for v  4173        165921470 
linux-modules-extra-  Linux kernel extra modules for v  4172        165884678 
libreoffice-common    office productivity suite -- arc  3551        76686149  
gnome-accessibility-  High Contrast GTK+ 2 theme and i  3464        3713621   
ubuntu-mono           Ubuntu Mono Icon theme            3025        3755093   
ncurses-term          additional terminal type definit  2727        1987483   
manpages-dev          Manual pages about using GNU/Lin  2101        2192620   
linux-firmware        Firmware for Linux kernel driver  1938        331497257 
tzdata                time zone and daylight-saving ti  1834        1210058   
vim-runtime           Vi IMproved - Runtime files       1642        27941732  
codium                Code editing. Redefined.          1307        271907088 
zsh-common            architecture independent files f  1256        12261077  
perl-modules-5.26     Core Perl modules                 1144        18015966  
adwaita-icon-theme    default icon theme of GNOME (sma  1127        4848678   
gimp-data             Data files for GIMP               1032        45011675

The query takes some time to run (it has a lot to do) and from the results we see that it produces 4 columns: package name, description, number of files in the package, and total size of the package. Let’s take this query apart and see how it works. At the uppermost level we see that the query follows the normal pattern of a SELECT statement.

SELECT list_of_items FROM Package_Descriptions
ORDER BY total_files DESC
LIMIT 20;

The basic structure is simple. What’s interesting is the list_of_items part. We know the list of items is a comma-separated list of items to output, so if we follow the commas we can see the list:

package_name
description
(SELECT COUNT(file) FROM Package_Files WHERE Package_Descriptions.package_name = Package_Files.package_name) AS files
(SELECT SUM(size_bytes) FROM Package_Files WHERE Package_Descriptions.package_name = Package_Files.package_name) AS size

It’s also possible to use a subquery in a WHERE clause. Consider this query that we will name subquery_demo2.sql:

-- subquery_demo2.sql

-- Query to list all packages containing more than 1000 files

.mode column
.header on
.width 20 60
SELECT package_name, description
  FROM Package_Descriptions
 WHERE 1000 < (SELECT COUNT(file)
          FROM Package_Files
         WHERE Package_Descriptions.package_name = Package_Files.package_name)
 ORDER BY package_name;

When we execute this, we get the following results:

me@linuxbox:~/adventure-sql$ sqlite3 adv-sql.sqlite < subquery_demo2.sql
package_name          description                                                 
--------------------  --------------------------------------------------------
adwaita-icon-theme    default icon theme of GNOME (small subset)              
codium                Code editing. Redefined.                                
gimp-data             Data files for GIMP                                     
gnome-accessibility-  High Contrast GTK+ 2 theme and icons                    
humanity-icon-theme   Humanity Icon theme                                     
inkscape              vector-based drawing program                            
libreoffice-common    office productivity suite -- arch-independent files     
linux-firmware        Firmware for Linux kernel drivers                       
linux-headers-4.15.0  Header files related to Linux kernel version 4.15.0     
linux-headers-4.15.0  Linux kernel headers for version 4.15.0 on 64 bit x86 SM
linux-headers-4.15.0  Header files related to Linux kernel version 4.15.0     
linux-headers-4.15.0  Linux kernel headers for version 4.15.0 on 64 bit x86 SM
linux-modules-4.15.0  Linux kernel extra modules for version 4.15.0 on 64 bit 
linux-modules-4.15.0  Linux kernel extra modules for version 4.15.0 on 64 bit 
linux-modules-extra-  Linux kernel extra modules for version 4.15.0 on 64 bit 
linux-modules-extra-  Linux kernel extra modules for version 4.15.0 on 64 bit 
manpages-dev          Manual pages about using GNU/Linux for development      
ncurses-term          additional terminal type definitions                     
perl-modules-5.26     Core Perl modules                                        
tzdata                time zone and daylight-saving time data                  
ubuntu-mono           Ubuntu Mono Icon theme                                   
vim-runtime           Vi IMproved - Runtime files                              
zsh-common            architecture independent files for Zsh

Updating Tables

The UPDATE statement is used to change values in one or more existing rows. We will demonstrate this by adding 100 to the size of each file in the sqlite3 package. First, let’s look at the files in the package.

sqlite> .mode column
sqlite> .header on
sqlite> .width 50 -10
sqlite> SELECT file, size_bytes FROM Package_Files
   ...> WHERE package_name = 'sqlite3';
file                                                size_bytes
--------------------------------------------------  ----------
/usr/bin/sqldiff                                       1103280
/usr/bin/sqlite3                                       1260976
/usr/share/doc/sqlite3/copyright                          1261
/usr/share/man/man1/sqlite3.1.gz                          3596
/usr/share/doc/sqlite3/changelog.Debian.gz                  35

Next, we’ll update the table, adding 100 to the size of each file.

sqlite> UPDATE Package_Files SET size_bytes = size_bytes + 100
   ...> WHERE package_name = 'sqlite3';

When we examine the rows now, we see the change.

sqlite> SELECT file, size_bytes FROM Package_Files
   ...> WHERE package_name = 'sqlite3';
file                                                size_bytes
--------------------------------------------------  ----------
/usr/bin/sqldiff                                       1103380
/usr/bin/sqlite3                                       1261076
/usr/share/doc/sqlite3/copyright                          1361
/usr/share/man/man1/sqlite3.1.gz                          3696
/usr/share/doc/sqlite3/changelog.Debian.gz                 135

Finally, we’ll subtract 100 from each row to return the sizes to their original values.

sqlite> UPDATE Package_Files SET size_bytes = size_bytes - 100
   ...> WHERE package_name = 'sqlite3';

UPDATE can modify multiple values at once. To demonstrate this, we will create a new table called Package_Stats and use UPDATE to fill in the values. Since this one is a little complicated, we will put this in a file named create_Package_Stats.sql.

-- create_Package_Stats.sql

DROP TABLE IF EXISTS Package_Stats;
CREATE TABLE Package_Stats (
    package_name  VARCHAR(60),
    count         INTEGER,
    tot_size      INTEGER,
    min_size      INTEGER,
    max_size      INTEGER,
    avg_size      REAL
);

INSERT INTO Package_Stats (package_name)
     SELECT package_name
       FROM Package_Descriptions;

UPDATE Package_Stats
   SET count = (SELECT COUNT(file)
                  FROM Package_Files
                 WHERE Package_Files.package_name =
                       Package_Stats.package_name),
    tot_size = (SELECT SUM(size_bytes)
                  FROM Package_Files
                 WHERE Package_Files.package_name = 
                       Package_Stats.package_name),
    min_size = (SELECT MIN(size_bytes)
                  FROM Package_Files
                 WHERE Package_Files.package_name = 
                       Package_Stats.package_name),
    max_size = (SELECT MAX(size_bytes)
                  FROM Package_Files
                 WHERE Package_Files.package_name = 
                       Package_Stats.package_name),
    avg_size = (SELECT AVG(size_bytes)
                  FROM Package_Files
                 WHERE Package_Files.package_name = 
                       Package_Stats.package_name);

This file consists of four SQL statements. The first two are used to create the new table, as we have seen before. The third statement is an alternate form of the INSERT statement. This form is useful, as it copies a value from one table into another. This INSERT will create all the rows we need but only fill in the package_name column. To fill in the rest, we will use an UPDATE that fills in the remaining five values based on the results of some queries of the Package_Files table. Note that without a WHERE clause, UPDATE applies changes to every row.

Once the table is constructed, we can examine its contents.

sqlite> .width 25 -5 -10 -8 -8 -10
sqlite> SELECT * FROM Package_Stats LIMIT 10;
package_name               count    tot_size  min_size  max_size    avg_size
-------------------------  -----  ----------  --------  --------  ----------
a2ps                         299     3455890       117    388096  11558.1605
accountsservice               19      261704        42    182552  13773.8947
acl                           11       91106        30     35512  8282.36363
acpi-support                  18       13896        67      4922       772.0
acpid                         19       86126       115     52064  4532.94736
adduser                       81      246658         7     37322  3045.16049
adium-theme-ubuntu           137      126759        25     12502  925.248175
adwaita-icon-theme          1127     4848678        30     87850  4302.28748
aisleriot                    316     1890864        47    281544  5983.74683
alsa-base                     42      195295        24     34160  4649.88095

We’ll come back to this table a little later when we take a look at joins.

Deleting Rows

Deleting rows is pretty easy in SQL. There is a DELETE statement with a WHERE clause to specify a target. We’ll demonstrate that, but first there’s a nifty trick that SQLite supports.

We can change the output mode to write out INSERT statements. Let’s try it out.

sqlite> .mode insert Package_Files

If we use this .mode setting, we tell SQLite that we want INSERT statements directed at the specified table, in this case, Package_Files. Once we set this output mode, we can see the result.

sqlite> SELECT * FROM Package_Files WHERE package_name = 'sqlite3';
INSERT INTO Package_Files VALUES('sqlite3','/usr/bin/sqldiff',1103380);
INSERT INTO Package_Files VALUES('sqlite3','/usr/bin/sqlite3',1261076);
INSERT INTO Package_Files VALUES('sqlite3','/usr/share/doc/sqlite3/copyright',1
361);
INSERT INTO Package_Files VALUES('sqlite3','/usr/share/man/man1/sqlite3.1.gz',3
696);
INSERT INTO Package_Files VALUES('sqlite3','/usr/share/doc/sqlite3/changelog.De
bian.gz',135);

We’ll repeat this SELECT, but first we’ll change the output from standard output to a file named insert_sqlite3.sql.

sqlite> .output insert_sqlite3.sql
sqlite> select * from Package_Files where package_name = 'sqlite3';

This will write the stream of INSERT statements to the specified file. Next we’ll set the output back to standard output by issuing the .output dot command without an output file name.

sqlite> .output

Now let’s delete the rows in the Package_Files table.

sqlite> DELETE FROM Package_Files WHERE package_name = 'sqlite3';

We can confirm the deletion by running our query again and we see an empty result.

sqlite> .header on
sqlite> .mode column
sqlite> .width 12 50 -10
sqlite> SELECT * FROM Package_Files WHERE package_name = 'sqlite3';
sqlite>

Since we saved an SQL stream that can restore the deleted rows, we can now put them back in the table. The .read dot command can read the stream and execute it as though it came from standard input.

sqlite> .read insert_sqlite3.sql

Now when we run our query, we see that the rows have been restored.

sqlite> SELECT * FROM Package_Files WHERE package_name = 'sqlite3';
package_name  file                                                size_bytes
------------  --------------------------------------------------  ----------
sqlite3       /usr/bin/sqldiff                                       1103280
sqlite3       /usr/bin/sqlite3                                       1260976
sqlite3       /usr/share/doc/sqlite3/copyright                          1261
sqlite3       /usr/share/man/man1/sqlite3.1.gz                          3596
sqlite3       /usr/share/doc/sqlite3/changelog.Debian.gz                  35

Adding and Deleting Columns

SQL provides the ALTER TABLE statement to modify table’s schema. To demonstrate this, we will add a couple of columns to the Package_Descriptions table and fill them with values calculated from the Package_Files table. We’ll place the necessary SQL in the add_column.sql file.

-- add_column.sql

-- Add and populate columns to Package_Descriptions

ALTER TABLE Package_Descriptions ADD COLUMN files INTEGER;
ALTER TABLE Package_Descriptions ADD COLUMN size INTEGER;

UPDATE Package_Descriptions
   SET files = (SELECT COUNT(file)
                  FROM Package_Files
                 WHERE Package_Files.package_name =
                       Package_Descriptions.package_name),
        size = (SELECT SUM(size_bytes)
                  FROM Package_Files
                 WHERE Package_Files.package_name =
                       Package_Descriptions.package_name);

We’ll execute the statements and examine resulting schema.

sqlite> .read add_column.sql
sqlite> .schema
CREATE TABLE Package_Descriptions (
    package_name VARCHAR(60),
    description  VARCHAR(120),
    files        INTEGER,
    size         INTEGER);

sqlite> SELECT * FROM Package_Descriptions WHERE package_name = 'sqlite3';
package_name  description                          files       size      
------------  -----------------------------------  ----------  ----------
sqlite3       Command line interface for SQLite 3  5           2369648

SQL provides another ALTER TABLE statement for deleting columns from a table. It has the following form:

ALTER TABLE table_name DROP COLUMN column_name;

Unfortunately, SQLite does not support this so we have to do it the hard way. This is accomplished in four steps:

Create a new temporary table with the desired schema.
Copy the columns we want to keep into the temporary table.
Drop the original table.
Rename the temporary table.

Here is a file called drop_column.sql that does the job.

-- drop_column.sql

-- Remove extra columns from Package_Descriptions

BEGIN TRANSACTION;

-- Create new table with temporary name

DROP TABLE IF EXISTS temp;
CREATE TABLE temp (
    package_name VARCHAR(60),
    description  VARCHAR(120));

-- Copy columns we want into new table

INSERT INTO temp
    (package_name, description)
    SELECT package_name, description
      FROM Package_Descriptions;

-- Get rid of old table and rename the new replacement table

DROP TABLE Package_Descriptions;

ALTER TABLE temp RENAME TO Package_Descriptions;

COMMIT;

We again use the alternate form of the INSERT statement for copying values from one table to another. We copy the package_name and description columns from the Package_Descriptions table to the correspondingly named columns in the temp table.

Joins

A join is a method by which we perform a query and produce a result that combines the data from two tables. SQLite supports several types of joins but we’re going to focus on the most commonly used type called an inner join. We can think of an inner join as the intersection of two tables. In the example below, a file called join_demo.sql, we will produce a query result that matches our earlier example when we added columns to the Package_Descriptions table.

-- join_demo.sql

-- Demonstrate join by selecting columns from 2 tables

.header on
.mode column
.width 20 35 -5 -10

SELECT Package_Descriptions.package_name AS Package,
       description AS Description,
       count AS Files,
       tot_size AS Size
  FROM Package_Descriptions
  INNER JOIN Package_Stats
          ON Package_Descriptions.package_name =
             Package_Stats.package_name
LIMIT 10;

The results of this query are as follows:

Package               Description                          Files        Size
--------------------  -----------------------------------  -----  ----------
a2ps                  GNU a2ps - 'Anything to PostScript'    299     3455890
accountsservice       query and manipulate user account i     19      261704
acl                   Access control list utilities           11       91106
acpi-support          scripts for handling many ACPI even     18       13896
acpid                 Advanced Configuration and Power In     19       86126
adduser               add and remove users and groups         81      246658
adium-theme-ubuntu    Adium message style for Ubuntu         137      126759
adwaita-icon-theme    default icon theme of GNOME (small    1127     4848678
aisleriot             GNOME solitaire card game collectio    316     1890864
alsa-base             ALSA driver configuration files         42      195295

If we break down this query, we see that it starts out as we expect, then it is followed by the INNER JOIN statement. The INNER keyword is optional as an inner join is the default. After the INNER JOIN we have to specify the relationship on which the join is based. In this example, we are looking for matching instances of the package_name in both tables. Any expression my be used to convey the table relationship, provided that the result of the expression is true or false.

Since package_name is a column in both the Package_Descriptions and Package_Stats tables, we must precede it with the name of the respective table to avoid ambiguity.

Views

The join example above is a pretty useful query for our tables, but due to its complexity it’s best executed from a file rather than as an ad hoc query. SQL addresses this issue by providing a feature called views that allows a complex query to be stored in the database and used to produce a virtual table that can be used with simple query commands. In the following example we will create a view using our INNER JOIN query above to create a virtual table called Stats that we can use as the target of subsequent queries.

-- view_demo.sql

DROP VIEW IF EXISTS Stats;
CREATE VIEW Stats
AS
    SELECT Package_Descriptions.package_name AS Package,
           description AS Description,
           count AS Files,
           tot_size AS Size
      FROM Package_Descriptions
      INNER JOIN Package_Stats
              ON Package_Descriptions.package_name =
                 Package_Stats.package_name
        ORDER BY Package;

Once our view is created, we can treat Stats as just another table in our database even though it does not really exist as such.

sqlite> .header on
sqlite> .mode column
sqlite> .width 20 35 -5 -10
sqlite> SELECT * FROM Stats LIMIT 10;
Package               Description                          Files        Size      
--------------------  -----------------------------------  -----  ----------
a2ps                  GNU a2ps - 'Anything to PostScript'    299     3455890   
accountsservice       query and manipulate user account i     19      261704    
acl                   Access control list utilities           11       91106     
acpi-support          scripts for handling many ACPI even     18       13896     
acpid                 Advanced Configuration and Power In     19       86126     
adduser               add and remove users and groups         81      246658    
adium-theme-ubuntu    Adium message style for Ubuntu         137      126759    
adwaita-icon-theme    default icon theme of GNOME (small    1127     4848678   
aisleriot             GNOME solitaire card game collectio    316     1890864   
alsa-base             ALSA driver configuration files         42      195295

To delete a view we use the DROP VIEW statement as follows:

sqlite> DROP VIEW Stats;

Indexes

It’s been said that the three most important features of a database system are “performance, performance, and performance.” While this in not exactly true (things like data integrity and reliability are important, too), complex operations on large databases can get really slow, so it’s important to make things as fast as we can. One feature we can take advantage of are indexes.

An index is a data structure the database maintains that speeds up database searches. It’s a sorted list of rows in a table ordered by elements in one or more columns. Without an index, a table is sorted according to values in a hidden column called rowid. The values in this column are integers that start with 1 and increment each time a row is added to the table. Here we see a query that selects the 100th row from the Package_Stats table.

sqlite> .header on
sqlite> .mode column
sqlite> .width 20 -5 -8 -8 -8 -8
sqlite> SELECT * FROM Package_Stats WHERE rowid = 100;
package_name          count  tot_size  min_size  max_size  avg_size
--------------------  -----  --------  --------  --------  --------
cups-server-common      595   1996400         0    370070  3355.294

Our database server can locate this row in the table very quickly because it already knows where to find the 100th row. However, if we want to search for the row that contains package name cups-server-common, SQLite must examine every row in the table to locate the matching row. To facilitate performance tuning, SQLite provides a way to see what search strategy is used during a query.

sqlite> .width -8 -5 -4 55 
sqlite> EXPLAIN QUERY PLAN
   ...> SELECT * FROM Package_Stats WHERE package_name = 'cups-server-common';
selectid  order  from  detail
--------  -----  ----  ------------------------------------------------------- 
       0      0     0  SCAN TABLE Package_Stats

We can see from the SCAN TABLE Package_Stats the SQLite performs a sequential search of the table during this query.

To create an index to allow faster searches of package_name we can do the following:

sqlite> CREATE INDEX idx_package_name
   ...> ON Package_Stats (package_name);

After doing this, we’ll look at the query plan and see the difference.

sqlite> EXPLAIN QUERY PLAN
   ...> SELECT * FROM Package_Stats WHERE package_name = 'cups-server-common';
   selectid  order  from  detail                                                 
   --------  -----  ----  -------------------------------------------------------
          0      0     0  SEARCH TABLE Package_Stats USING INDEX idx_package_name

Hereafter, when we search the table for a package name, SQLite will use the index to directly get to the row rather than looking at every row searching for a match. So why don’t we just index everything? The reason we don’t is that indexes impose overhead every time a row is inserted, deleted, or updated since the indexes must be kept up to date. Indexes are best used on tables that are read more often than written to.

We probably won’t see much of a performance improvement when searching the Package_Stats table because it’s just not that big, but on a large table the improvement can be substantial.

We can see the index when we examine the table’s schema.

sqlite> .schema Package_Stats
CREATE TABLE Package_Stats (
    package_name  VARCHAR(60),
    count         INTEGER,
    tot_size      INTEGER,
    min_size      INTEGER,
    max_size      INTEGER,
    avg_size      REAL
);
CREATE INDEX idx_package_name
ON Package_Stats (package_name);

SQLite also has a dot command.

sqlite> .indexes
idx_package_name

Another benefit of using an index is that it’s kept in sorted order (that’s how it performs searches quickly). The side effect is that when an index is used during a query the results of the query will be sorted as well. To demonstrate, we’ll create another index for the Package_Stats table, this time using the tot_size column. Notice that when we perform a select based on that column, the results are in ascending order.

sqlite> CREATE INDEX idx_tot_size
   ...> ON Package_Stats (tot_size);
sqlite> .width 20 -5 -10 -8 -8 -8
sqlite> SELECT * FROM Package_Stats
   ...> WHERE tot_size > 100000000;
   package_name          count    tot_size  min_size  max_size avg_size
   --------------------  -----  ----------  --------  -------- --------
   inkscape               1025   127507308         0  19599216 124397.3
   libreoffice-core        119   135106135        26  66158968 1135345.
   linux-modules-extra-   4172   165884678      1292   4216105 39761.42
   linux-modules-extra-   4173   165921470      1292   4216105 39760.71
   thunderbird              69   180861838        27  12163098 2621186.
   firefox                  74   203393773        23  12408360 2748564.
   google-chrome-stable    100   235727530        25  16288078 2357275.
   libgl1-mesa-dri:amd6     20   237774005        36  19548840 11888700
   codium                 1307   271907088        17  11551467 208039.0
   linux-firmware         1938   331497257         6  19922416 171051.2

To delete our indexes, we use the DROP INDEX statement.

sqlite> DROP INDEX idx_package_name;
sqlite> DROP INDEX idx_tot_size;

Triggers and Stored Procedures

As we saw earlier during our discussion of views, SQL allow us to store SQL code in the database. Besides views, SQL provides for two other ways of storing code. These two methods are stored procedures and triggers. Stored procedures, as the name implies, allows a block of SQL statements to be stored and treated as a subroutine available to other SQL programs, or for use during ad hoc interactions with the database. Creating a stored procedure is done with this syntax:

CREATE PROCEDURE procudure_name
AS
    [block of SQL code];

Parameters can be passed to stored procedures. Here is an example:

CREATE PROCEDURE list_pkg_files @package VARCHAR(60)
AS
    SELECT package_name, file
      FROM Package_Files
     WHERE package_name = @package;

To call this procedure, we would do this:

EXEC list_package_files @package = 'bash';

Unfortunately, SQLite does not support stored procedures. It does, however, support the second method of code storage, triggers.

Triggers are stored blocks of code that are automatically called when some event occurs and a specified condition is met. Triggers are typically used to perform certain maintenance tasks to keep the database in good working order.

Triggers can be set to activate before, after, or instead of the execution of INSERT, DELETE, or UPDATE statements. In the example below, we will have a trigger activate before a DELETE is performed on the Package_Files table.

/*
    trigger_demo.sql

    Trigger demo where we create a "trash can" for the
    Package_Files table and set a trigger to copy rows
    to the PF_Backup table just before they are deleted
    from Package_Files.
*/

-- Create backup table with the same schema as Package_Files

CREATE TABLE IF NOT EXISTS PF_Backup (
    package_name VARCHAR(60),
    file         VARCHAR(120),
    size_bytes   INTEGER
);

-- Define trigger to copy rows into PF_Backup as they are
-- deleted from Package_Files

CREATE TRIGGER backup_row_before_delete
BEFORE DELETE ON Package_Files
BEGIN
    INSERT INTO PF_Backup
    VALUES (OLD.package_name, OLD.file, OLD.size_bytes);
END;

The first thing we do is create a table to hold our deleted rows. We use a slightly different form of the CREATE TABLE statement to create the table only if it does not already exist. This will ensure that an existing table of saved rows will persist, even if we reload the trigger.

After creating the table, we create a trigger called backup_row_before_delete to copy data from the Package_Files table to the PF_Backup table just before any row in Package_Files is deleted.

In order to reference data that might be used by the trigger, SQL provides the NEW reference for new data that is inserted or updated, and the OLD reference for previous data that is updated or deleted. In our example, we use the OLD reference to refer to the data about to be deleted.

Performing Backups

Since SQLite uses an ordinary file to store each database (as opposed to the exotic methods used by some other systems), we can use regular command line tools such as cp to perform database backups. There is an interesting SQL method we can use, too. The .dump dot command will produce a stream of SQL statements that will fully reproduce the database including tables, views, triggers, etc. To output the database this way, we need only do the following:

sqlite> .dump

The stream will appear on standard output or we can use the .output dot command to direct the stream to the file of our choice.

One interesting application of this technique would be to combine tables from multiple databases into one. For example, let’s imagine we had several Raspberry Pi computers each performing data logging of an external sensor. We could collect dumps from each machine and combine all of the tables into a single database for data analysis and reporting.

Generating Your Own Datasets

Below are the programs used to create the datasets used in this adventure. They are included in the archive for those who want to create their own datasets.

For Deb-based Systems (Debian, Ubuntu, Mint, Raspberry Pi OS)

The first program named mk_package_descriptions-deb, extracts package information and outputs a .tsv file.

#!/bin/bash

# mk_package_descriptions-deb - Output tsv list of installed debian/ubnutu
#                               packages on standard output

phase1() { # Replace repeated spaces with a tab
  awk '
  {
    gsub(/[ ][ ]+/, "\t")
    print $0
  }'
  return
}

phase2() { # Output field 2 and 5 separated by a tab
  awk '
    BEGIN {
      FS = "\t"
    }

    $1 == "ii" {
      print $2 "\t" $5
    }'
  return
}

dpkg-query -l | phase1 | phase2

The second program, mk_package_files-deb outputs all the files included in each package.

#!/bin/bash

# mk_package_files - make list of files in all packages
# Reads *.list files in LIST_DIR. Outputs stream of tsv to stdout.

LIST_DIR=/var/lib/dpkg/info

mk_list () {

  local list_file="$1"
  local lf_length="${#list_file}"
  local len
  local package
  local -a files

  ((len = lf_length - 5))
  package="${list_file:0:$len}" # strip '.list' extension
  package="${package##*/}" # strip leading pathname
  mapfile files < "$list_file" # load list into array
  for i in "${files[@]}"; do
    i="${i//[$'\t\r\n']}" # strip trailing newlines
    if [[ -f "$i" ]] ; then # write tsv file
      printf "%s\t%s\t%s\n" \
        "$package" \
        "$i" \
        "$(stat --printf '%s' "$i")" # size of file
    fi
  done
  return
}

for i in "$LIST_DIR"/*.list; do
  mk_list "$i"
done

For RPM-based Systems (RHEL, CentOS, Fedora)

The mk_package_descriptions-rpm script:

#!/bin/bash

# mk_package_descriptions-rpm - Output tsv list of installed Fedora/CentOS
#                               packages on standard output

while read package; do
  description=$(dnf info "$package" \
    | awk '$1 == "Summary" { gsub(/Summary      : /, ""); print $0; exit }')
  printf "%s\t%s\n" \
    "$package" \
    "$description"
done < <( dnf list installed | awk 'NF == 3 { print $1 }' )

The mk_package_files-rpm script:

#!/bin/bash

# mk_package_files-rpm - Output tsv list of installed Fedora/CentOS files
#                        on standard output

while read package; do
  while read package_file; do
    if [[ -r "$package_file" ]]; then # not all files are present/readable
      printf "%s\t%s\t%s\n" \
        "$package" \
        "$package_file" \
        "$(stat --printf '%s' "$package_file")"
    fi
  done < <( rpm -ql "$package" )
done < <( dnf list installed | awk 'NF == 3 { print $1 }' )

Converting .tsv to SQL

Below are two AWK programs used to convert the .tsv files into SQL. First, the insert_package_descriptions.awk program:

#!/usr/bin/awk -f

# insert_package_descriptions.awk - Insert records from
#                                   package_descriptions.tsv

BEGIN {
    FS="\t"
    print "DROP TABLE IF EXISTS Package_Descriptions;"
    print "CREATE TABLE Package_Descriptions ("
    print "    package_name VARCHAR(60),"
    print "    description  VARCHAR(120)"
    print ");"
    print "BEGIN TRANSACTION;" # vastly improves performance
}

{
    gsub(/'/, "''") # double up single quotes to escape them
    print "INSERT INTO Package_Descriptions"
    print "     VALUES ( '" $1 "', '" $2 "');"
}

END {
    print "COMMIT;"
}

Second, the insert_package_files.awk program:

#!/usr/bin/awk -f

# insert_package_files.awk - Insert records from
#                            package_files.tsv

BEGIN {
    FS="\t"
    print "DROP TABLE IF EXISTS Package_Files;"
    print "CREATE TABLE Package_Files ("
    print "    package_name VARCHAR(60),"
    print "    file         VARCHAR(120),"
    print "    size_bytes   INTEGER"
    print ");"
    print "BEGIN TRANSACTION;" # vastly improves performance
}

{
    gsub(/'/, "''") # double up single quotes to escape them
    print "INSERT INTO Package_Files"
    print "     VALUES ('" $1 "', '" $2 "','" $3 "');"
}

END {
    print "COMMIT;"
}

Summing Up

SQL is an important and widely used technology. It’s kinda fun too. While we looked at the general features and characteristics of SQL, there is much more to learn. For example, there are the more advanced concepts such as normalization, referential integrity, and relational algebra. Though we didn’t get to the really heady stuff, we did cover enough to get some real work done whenever we need to integrate data storage into our scripts and projects.