18 分钟阅读
https://linuxcommand.org/lc3_adv_sql.php
Okay kids, gird your grid for a big one.
The world as we know it is powered by data. Data, in turn, is stored in databases. Most everything we use computers for involves using data stored in some kind of database. When we talk about storing large amounts of data, we often mean relational database management systems (RDBMS). Banks, insurance companies, most every accounting system, and many, many websites use relational databases to store and organize their data.
The idea of relational data storage is generally credited to English computer scientist and IBM researcher E. F. Cobb, who proposed it in a paper in 1970. In the years that followed, a number of software companies built relational database systems with varying degrees of success. Around 1973, IBM developed a simplified and flexible method of managing relational databases called Structured Query Language (SQL, often pronounced “sequel”). Today the combination of RDBMS and SQL is a huge industry, generating many billions of dollars every year.
Relational databases are important to the history of Linux as well. It was the availability of open source database programs (such as MySQL) and web servers (most notably, Apache) that led to an explosion of Linux adoption in the early days of the world wide web.
In this adventure, we’re going to study the fundamentals of relational databases and use a popular command line program to create and manage a database of our own. The AWK adventure is a suggested prerequisite.
Before we can delve into SQL we have to look at what relational databases are and how they work.
Simply put, a relational database is one or more tables containing columns and rows. Technically, the columns are known as attributes and the rows as tuples, but most often they are simply called columns and rows. In many ways, a table resembles the familiar spreadsheet. In fact, spreadsheet programs are often used to prepare and edit database tables. Each column contains data of a consistent type, for example, one column might consist of integers and another column strings. A table can contain any number of rows.
The design of a database is called its schema and it can be simple, containing just a single table or two, or it can be complex, containing many tables with complex interrelationships between them.
Let’s imagine a database for a bookstore consisting of three tables. The first is called Customers
, the second is called Items
, and the third is called Sales
. The Customers
table will have multiple rows with each row containing information about one customer. The columns include a customer number, first and last names, and the customer’s address. Here is such a table with just some made-up names:
Cust First Last Street City ST
---- ------- -------- --------------------- ----------- --
0001 Richard Stollman 1 Outonthe Street Boston MA
0002 Eric Roymond 2 Amendment Road Bunker Hill PA
0003 Bruce Porens 420 Middleville Drive Anytown US
The Items
table lists our books and contains the item number, title, author, and price.
Item Title Author Price
---- -------------------------------------- ------------- -----
1001 Winning Friends and Influencing People Dale Carnegie 14.95
1002 The Communist Manifesto Marx & Engels 00.00
1003 Atlas Shrugged Ayn Rand 99.99
As we go about selling items in our imaginary bookstore, we generate rows in the Sales
table. Each sale generates a row containing the customer number, date and time of the sale, the item number, the quantity sold, and the total amount of the sale.
Cust Date_Time Item Quan Total
---- ------------ ---- ---- ------
0002 202006150931 1003 1 99.99
0001 202006151108 1002 1 0.00
0003 202006151820 1001 10 149.50
Now we might be wondering what the Cust
and Item
columns are for. They serve as keys. Keys are values that serve to uniquely identify a table row and to facilitate interrelationships between tables. Keys have some special properties. They must be both unique (that is, they can appear only once in a table and specifically identify a row) and they must also be immutable (they can never change). If they can’t meet these requirements, they cannot be keys. Some database implementations have methods of enforcing these requirements during table creation and keys can be formally specified. In the case of our bookstore database, the Cust
column contains the keys for the Customers
table and the Item
column contains the keys for the Items
table.
Knowing this about keys, we can now understand why the Sales
table works the way it does. We can see for example that row 1 of the Sales
table tells us that customer 0002
purchased 1 copy of item 1003
for $99.99. So why do we need special values for the keys? Why not, for instance, just use the customer’s name as the key? It’s because we can’t guarantee that the name won’t change, or that two people might have the same name. We can guarantee that an arbitrarily assigned value like our customer number is unique and immutable.
There are a number of database servers available for Linux. The two most prominent are MySQL (and its fork MariaDB) and PostgreSQL. These database servers implement client/server architecture, a concept that became a hot topic in the 1990s. Database servers of this type run as server processes and clients connect to them over a network connection and send SQL statements for the server to carry out. The results of the database operations are returned to the client program for further processing and presentation. Many web sites use this architecture with a web server sending SQL statements to a separate database server to dynamically create web pages as needed. The famous LAMP stack consisting of Linux, Apache web server, MySQL, and PHP powered much of the early web.
For purposes of this adventure, setting up database servers such as MySQL and PostgreSQL is too complicated to cover here since, among other things, they support multiple concurrent users and their attendant security controls. It’s more than we need for just learning SQL.
The database server we will be using is SQLite. SQLite is a library that can be used with applications to provide a high-level of SQL functionality. It’s very popular with the embedded systems crowd. It saves application developers the trouble of writing custom solutions to their data storage and management tasks. In addition to the library, SQLite provides a command line tool for directly interacting with SQLite databases. Also, since it accepts SQL from standard input (as well as it own command line interface) and sends its results to standard output, it’s ideal for use in our shell scripts.
SQLite is available from most Linux distribution repositories. Installation is easy, for example:
me@linuxbox:~$ sudo apt install sqlite3
Let’s build a playground and play with some real data. On the LinuxCommand.org site there is a archive we can download that will do the trick.
me@linuxbox:~$ cd
me@linuxbox:~$ curl -c http://linuxcommand.org/adventure-sql.tgz
me@linuxbox:~$ tar xzf adventure-sql.tgz
me@linuxbox:~$ cd adventure-sql
Extracting the .tgz
file will produce the playground directory containing the data sets, some demonstration code, and some helpful tools. The data sets we will be working with contain the listings of installed packages on an Ubuntu 18.04 system. This will include the name of packages, a short description of each one, a list of files contained in each package, and their sizes.
me@linuxbox:~/adventure-sql$ ls
All the files are human-readable text, so feel free to give them a look. The data sets in the archive are the .tsv
files. These are tab-separated value files. The first one is the package_descriptions.tsv
file. This contains two columns of data; a package name and a package description. The second file, named package_files.txv
, has three columns: a package name, the name of a file installed by the package and the size of the file.
To launch SQLite, we simply issue the command sqlite3
followed optionally by the name of a file to hold our database of tables. If we do not supply a file name, SQLite will create a temporary database in memory.
me@linuxbox:~/advemture-sql$ sqlite3
SQLite version 3.22.0 2018-01-22 18:45:57
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite>
When loading is complete, SQLite will present a prompt where we can enter commands. Commands can be either SQL statements or dot commands that are used to control SQLite itself. To see a list of the available dot commands, we enter .help
at the prompt.
sqlite> .help
There are only a few of the dot commands that will be of interest to us and they deal mainly with how output is formatted. To exit sqlite3
, we enter the dot command .quit
at the prompt.
sqlite> .quit
Though we can interact directly with the sqlite3
program using the sqlite>
prompt, sqlite3
can also accept streams of dot commands and SQL statements through standard input. This is how SQLite is most often used.
To get started with our database, we need to first convert our .tsv
files into a stream of SQL statements. Our database will initially consist of two tables. The first is named Package_Descriptions
and the second is named Package_Files
. To create the SQL stream for the Package_Descriptions
table we will use the insert_Package_Descriptions.awk
program supplied in the playground archive.
me@linuxbox:~/advemture-sql$ ./insert_Package_Descriptions.awk \
< package_descriptions.tsv > insert_Package_Descriptions.sql
Let’s take a look at the resulting SQL stream. We’ll use the head
command to display the first few lines of the stream.
me@linuxbox:~/advemture-sql$ head insert_Package_Descriptions.sql
DROP TABLE IF EXISTS Package_Descriptions;
CREATE TABLE Package_Descriptions (
package_name VARCHAR(60),
description VARCHAR(120)
);
BEGIN TRANSACTION;
INSERT INTO Package_Descriptions
VALUES ( 'a2ps',
'GNU a2ps - ''Anything to PostScript'' converter and pretty-prin
ter');
INSERT INTO Package_Descriptions
VALUES ( 'accountsservice',
'query and manipulate user account information');
And the last few lines using the tail
command.
me@linuxbox:~/advemture-sql$ tail insert_Package_Descriptions.sql
VALUES ( 'zlib1g:amd64',
'compression library - runtime');
INSERT INTO Package_Descriptions
VALUES ( 'zlib1g:i386',
'compression library - runtime');
INSERT INTO Package_Descriptions
VALUES ( 'zlib1g-dev:amd64',
'compression library - development');
INSERT INTO Package_Descriptions
VALUES ( 'zsh',
'shell with lots of features');
INSERT INTO Package_Descriptions
VALUES ( 'zsh-common',
'architecture independent files for Zsh');
COMMIT;
As we can see, SQL is verbose and somewhat English-like. Convention dictates that language keywords be in uppercase; however it is not required. SQL is case insensitive. White space is not important, but is often used to make the SQL statements easier to read. Statements can span multiple lines but don’t have to. Statements are terminated by a semicolon character. The SQL in this adventure is generally formatted in accordance with the style guide written by Simon Holywell linked in the “Further Reading” section below. Since some SQL can get quite complicated, visual neatness counts when writing code.
SQL supports two forms of commenting.
-- Single line comments are preceeded by 2 dashes
/* And multi-line comments are done in the
style of the C programming language */
Before we go on, we need to digress for a moment to discuss SQL as a standard. While there are ANSI standards for SQL, every database server implements SQL differently. Each one has a slightly different dialect. The reason for this is partly historical; in the early days there weren’t any firm standards, and partly commercial. Each database vendor wanted to make it hard for customers to migrate to competing products so each vendor added unique extensions and features to the language to promote the dreaded “vendor lock-in” for their product. SQLite supports most of standard SQL (but not all of it) and it adds a few unique features.
The first 2 lines of our SQL stream deletes any existing table named Package_Descriptions
and creates a new table with that name. The DROP TABLE
statement deletes a table. The optional IF EXISTS
clause is used to prevent errors if the table does not already exist. There are a lot of optional clauses in SQL. The CREATE TABLE
statement defines a new table. As we can see, this table will have 2 columns. The first column, package_name
is defined to be a variable length string up to 60 characters long. VARCHAR
is one of the available data types we can define. Here are some of the common data types supported by SQL databases:
Data Type | Description |
---|---|
INTEGER | Integer |
CHAR(n) | Fixed length string |
VARCHAR(n) | Variable length string |
NUMERIC | Decimal numbers |
REAL | Floating point numbers |
DATETIME | Date and time values |
SQL databases support many types of data. Unfortunately, this varies by vendor. Even in cases where two databases share the same data type name, the actual meaning of the data type can differ. Data types in SQLite, unlike other SQL databases, are not rigidly fixed. Values in SQLite are dynamically typed. While SQLite allows many of the common data types found in other databases to be specified, it actually only supports 4 general types of data storage.
Data Type | Description |
---|---|
INTEGER | Signed integers using 1, 2, 3, 4, 6, or 8 bytes as needed |
REAL | 8-byte IEEE floating point numbers |
TEXT | Text strings |
BLOB | Binary large objects (for example JPEG, or MP3 files) |
In our example above, we specified VARCHAR
as the data type for our columns. SQLite is perfectly happy with this, but it actually stores the values in these columns as just TEXT
. It ignores the length restrictions set in the data type specification. SQLite is extremely lenient about data types. In fact, it allows any kind of data to be stored in any specified data type, even allowing a mixture of data types in a single column. This is completely incompatible with all other databases, and relying on this would be very bad practice. In the remainder of this adventure we will be sticking to conventional data types and behavior.
Moving on with our SQL stream, we see that the majority of the stream consists of INSERT
statements. This is how rows are added to a table. Insert is sort of a misnomer as INSERT
statements append rows to a table.
We surround the INSERT
statements with BEGIN TRANSACTION
and COMMIT
. This is done for performance reasons. If we leave these out, the rows will still be appended to the table but each INSERT
will be treated as a separate transaction, vastly increasing the amount of time it takes to append a large number of rows. Treating a transaction this way also has another important benefit. SQL does not apply the transaction to the database until it receives the COMMIT
statement, thus it is possible to write SQL code that will abandon a transaction if there is a problem and the change will be rolled back leaving the database unchanged.
Let’s go ahead and create our first table and add the package names and descriptions.
me@linuxbox:~/advemture-sql$ sqlite3 adv-sql.sqlite \
< insert_Package_Descriptions.sql
We execute the sqlite3
program specifying adv-sql.sqlite
as the file used to store our tables. The choice of file name is arbitrary. We read our SQL stream into standard input and sqlite3
carries out the statements.
Now that we have a database (albeit a small one), let’s take a look at it. To do this, we will start up sqlite3
and interact with it at the prompt.
me@linuxbox:~/advemture-sql$ sqlite3 adv-sql.sqlite
SQLite version 3.22.0 2018-01-22 18:45:57
Enter ".help" for usage hints.
sqlite>
We’ll first use some SQLite dot commands to examine the structure of the database.
sqlite> .tables
Package_Descriptions
sqlite> .schema Package_Descriptions
CREATE TABLE Package_Descriptions (
package_name VARCHAR(60),
description VARCHAR(120)
);
The .tables
dot command displays a list of tables in the database while the .schema
dot command lists the statements used to create the specified table.
Next, we’ll get into some real SQL using SELECT
, probably the most frequently used SQL statement.
sqlite> SELECT * FROM Package_Descriptions;
a2ps|GNU a2ps - 'Anything to PostScript' converter and pretty-printer
accountsservice|query and manipulate user account information
acl|Access control list utilities
acpi-support|scripts for handling many ACPI events
acpid|Advanced Configuration and Power Interface event daemon
adduser|add and remove users and groups
adium-theme-ubuntu|Adium message style for Ubuntu
adwaita-icon-theme|default icon theme of GNOME (small subset)
aisleriot|GNOME solitaire card game collection
alsa-base|ALSA driver configuration files
.
.
.
This is the simplest form of the SELECT
statement. The syntax is the word SELECT
followed by a list of columns or calculated values we want, and a FROM
clause specifying the source of the data. This example uses *
which means every column. Alternately, we could explicitly name the columns, like so:
sqlite> SELECT package_name, description FROM Package_Descriptions;
And achieve the same result.
As we can see from the output above, the default format is fine for further processing by tools such as awk
, but it leaves a lot to be desired when it comes to being read by humans. We can adjust the output format with some dot commands. We’ll also add the LIMIT
clause to the end of our query to output just 10 rows.
sqlite> .headers on
sqlite> .mode column
sqlite> SELECT * FROM Package_Descriptions LIMIT 10;
package_name description
------------ ----------------------------------------------------------------
a2ps GNU a2ps - 'Anything to PostScript' converter and pretty-printer
accountsserv query and manipulate user account information
acl Access control list utilities
acpi-support scripts for handling many ACPI events
acpid Advanced Configuration and Power Interface event daemon
adduser add and remove users and groups
adium-theme- Adium message style for Ubuntu
adwaita-icon default icon theme of GNOME (small subset)
aisleriot GNOME solitaire card game collection
alsa-base ALSA driver configuration files
By using the .headers on
and .mode column
dot commands, we add the column headings and change the output to column format. These settings stay in effect until we change them. The .mode
dot command has a number of interesting possible settings.
Mode | Description | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
csv | Comma-separated values | ||||||||||
column | Left-aligned columns. Use .width n1 n2… to set column widths. | ||||||||||
html | HTML
Here we will set the mode and column widths for our table.
In addition to listing columns,
Being SelectiveWe can make
A partial match using SQL wildcard characters:
SQL supports two wildcard characters. The underscore ( Notice too that strings are surrounded with single quotes. If a value is quoted this way, SQL treats it as a string. For example, the value Sorting OutputUnless we tell
The default sorting order is ascending, but we can also sort in descending order by including
Adding Another TableTo demonstrate more of what we can do with
The second table is named
As we can see, this table has 3 columns; the package name, the name of a file installed by the package, and the size of the installed file in bytes. Let’s do a
Notice the SubqueriesThe
As we saw before,
To make complicated expressions more readable, we can assign their results to aliases by using the
An important feature of
We’ll next run this query and view the results.
The query takes some time to run (it has a lot to do) and from the results we see that it produces 4 columns: package name, description, number of files in the package, and total size of the package. Let’s take this query apart and see how it works. At the uppermost level we see that the query follows the normal pattern of a
The basic structure is simple. What’s interesting is the
It’s also possible to use a subquery in a
When we execute this, we get the following results:
Updating TablesThe
Next, we’ll update the table, adding 100 to the size of each file.
When we examine the rows now, we see the change.
Finally, we’ll subtract 100 from each row to return the sizes to their original values.
This file consists of four SQL statements. The first two are used to create the new table, as we have seen before. The third statement is an alternate form of the Once the table is constructed, we can examine its contents.
We’ll come back to this table a little later when we take a look at joins. Deleting RowsDeleting rows is pretty easy in SQL. There is a We can change the output mode to write out
If we use this
We’ll repeat this
This will write the stream of
Now let’s delete the rows in the
We can confirm the deletion by running our query again and we see an empty result.
Since we saved an SQL stream that can restore the deleted rows, we can now put them back in the table. The
Now when we run our query, we see that the rows have been restored.
Adding and Deleting ColumnsSQL provides the
We’ll execute the statements and examine resulting schema.
SQL provides another
Unfortunately, SQLite does not support this so we have to do it the hard way. This is accomplished in four steps:
Here is a file called
We again use the alternate form of the JoinsA join is a method by which we perform a query and produce a result that combines the data from two tables. SQLite supports several types of joins but we’re going to focus on the most commonly used type called an inner join. We can think of an inner join as the intersection of two tables. In the example below, a file called
The results of this query are as follows:
If we break down this query, we see that it starts out as we expect, then it is followed by the Since ViewsThe join example above is a pretty useful query for our tables, but due to its complexity it’s best executed from a file rather than as an ad hoc query. SQL addresses this issue by providing a feature called views that allows a complex query to be stored in the database and used to produce a virtual table that can be used with simple query commands. In the following example we will create a view using our
Once our view is created, we can treat
To delete a view we use the
IndexesIt’s been said that the three most important features of a database system are “performance, performance, and performance.” While this in not exactly true (things like data integrity and reliability are important, too), complex operations on large databases can get really slow, so it’s important to make things as fast as we can. One feature we can take advantage of are indexes. An index is a data structure the database maintains that speeds up database searches. It’s a sorted list of rows in a table ordered by elements in one or more columns. Without an index, a table is sorted according to values in a hidden column called
Our database server can locate this row in the table very quickly because it already knows where to find the 100th row. However, if we want to search for the row that contains package name
We can see from the To create an index to allow faster searches of
After doing this, we’ll look at the query plan and see the difference.
Hereafter, when we search the table for a package name, SQLite will use the index to directly get to the row rather than looking at every row searching for a match. So why don’t we just index everything? The reason we don’t is that indexes impose overhead every time a row is inserted, deleted, or updated since the indexes must be kept up to date. Indexes are best used on tables that are read more often than written to. We probably won’t see much of a performance improvement when searching the We can see the index when we examine the table’s schema.
SQLite also has a dot command.
Another benefit of using an index is that it’s kept in sorted order (that’s how it performs searches quickly). The side effect is that when an index is used during a query the results of the query will be sorted as well. To demonstrate, we’ll create another index for the
To delete our indexes, we use the
Triggers and Stored ProceduresAs we saw earlier during our discussion of views, SQL allow us to store SQL code in the database. Besides views, SQL provides for two other ways of storing code. These two methods are stored procedures and triggers. Stored procedures, as the name implies, allows a block of SQL statements to be stored and treated as a subroutine available to other SQL programs, or for use during ad hoc interactions with the database. Creating a stored procedure is done with this syntax:
Parameters can be passed to stored procedures. Here is an example:
To call this procedure, we would do this:
Unfortunately, SQLite does not support stored procedures. It does, however, support the second method of code storage, triggers. Triggers are stored blocks of code that are automatically called when some event occurs and a specified condition is met. Triggers are typically used to perform certain maintenance tasks to keep the database in good working order. Triggers can be set to activate before, after, or instead of the execution of
The first thing we do is create a table to hold our deleted rows. We use a slightly different form of the After creating the table, we create a trigger called In order to reference data that might be used by the trigger, SQL provides the Performing BackupsSince SQLite uses an ordinary file to store each database (as opposed to the exotic methods used by some other systems), we can use regular command line tools such as
The stream will appear on standard output or we can use the One interesting application of this technique would be to combine tables from multiple databases into one. For example, let’s imagine we had several Raspberry Pi computers each performing data logging of an external sensor. We could collect dumps from each machine and combine all of the tables into a single database for data analysis and reporting. Generating Your Own DatasetsBelow are the programs used to create the datasets used in this adventure. They are included in the archive for those who want to create their own datasets. For Deb-based Systems (Debian, Ubuntu, Mint, Raspberry Pi OS)The first program named
The second program,
For RPM-based Systems (RHEL, CentOS, Fedora)The
The
Converting .tsv to SQLBelow are two AWK programs used to convert the
Second, the
Summing UpSQL is an important and widely used technology. It’s kinda fun too. While we looked at the general features and characteristics of SQL, there is much more to learn. For example, there are the more advanced concepts such as normalization, referential integrity, and relational algebra. Though we didn’t get to the really heady stuff, we did cover enough to get some real work done whenever we need to integrate data storage into our scripts and projects. Further Reading
最后修改 June 8, 2023: 添加文件,部分文件未添加中文翻译 (ddb0575)
|