Advanced MySQL and Database Optimization

This article mainly helps to deepen the understanding of MySQL and optimize database performance in large data scenarios based on the foundation of learning MySQL basics.

MySQL Architecture

mysql-architecture.png
The design of MySQL reflects the idea of layered architecture, and can be divided from top to bottom as follows:

  1. Connection Layer: Includes various connectors and MySQL shell.
  2. Service Layer: SQL interface, parser, optimizer, cache, etc.
  3. Engine Layer: Includes storage engines such as InnoDB and MyISAM.
  4. Storage Layer: Handles interactions with disks and log files.

Engine Layer

Differences between InnoDB and MyISAM:

  1. InnoDB is designed for handling large datasets and high concurrency, while MyISAM focuses on performance.
  2. InnoDB supports row-level locking, transactions, and foreign keys, whereas MyISAM only supports table-level locking.
  3. InnoDB uses clustered indexes, while MyISAM uses non-clustered indexes as the following content shows.
  4. The MyISAM storage engine supports spatial data indexes (R-Tree), which can be used for geographic data storage.

MySQL Cache

The smallest unit of interaction between MySQL and disk is a page (default 1 page = 16KB).
The default size of the memory pool (Buffer Pool) is 128MB.
Thus, the default number of pages in the memory pool can be calculated as 128M / 16KB = 8172.

redo.log 和 bin.log

The redo log is the log at the InnoDB storage engine layer, while the binlog is the log recorded at the MySQL server layer. These two logs have some overlapping content (different formats). They can be used to recover data after a MySQL crash and restart.

Execution Process of an Update Statement

  1. Modify buffer (generate dirty pages).
  2. Generate corresponding redo.log and bin.log (log-ahead principle).
  3. Persist redo.log and bin.log.
  4. Return modification success.

## Transactions

Transactions refer to a set of operations that satisfy the ACID properties (Atomicity, Consistency, Isolation, Durability). A transaction can be committed using COMMIT, or rolled back using ROLLBACK.

MySQL Transaction Execution Process

  1. Start transaction.
  2. Apply for lock resources, place exclusive lock on row with id=2.
  3. Read the data pages to be modified into innodb_buffer_cache.
  4. Record the data with id=2 into the undo log.
  5. Record the modified data with id=2 into the redo log buffer.
  6. Change the name of id=2 in the buffer cache to “test”.
  7. Commit, trigger two-phase commit (2PC).
  8. End transaction.

Two-Phase Commit of Redo Logs

The writing of redo logs is split into two steps: prepare and commit. This is called “two-phase commit.”
It mainly ensures that the order of redo log transactions matches the order of binlog transactions (ensured by transaction IDs).

  1. Prepare phase: Persist redo to disk (redo group commit), and set the rollback segment to prepared status. Binlog does not perform any operations during this phase.
  2. Commit phase: InnoDB releases locks, releases rollback segments, sets the redo log commit status, persists binlog to disk, and then commits at the storage engine layer.
    In simple terms, after executing an update statement, first set the redo log status to prepared, and then set it to committed after the update content is written to binlog.

Redo log persistence strategy configuration:
=0: Do not persist immediately upon transaction commit; instead, let background processes handle it.
=1: …Persist immediately.
=2: …Immediately hand over to operating system cache.

Concurrent Consistency Issues:

  1. Lost updates.
  2. Dirty reads.
  3. Non-repeatable reads.
  4. Phantom reads.

The following table shows the concurrent consistency issues that difference isolation levels can solve:

Lost updates Dirty reads Non-repeatable reads Phantom reads
Read Uncommited
Read Commited
Repeatable Read
Serialization

MVCC (Multi-Version Concurrency Control)

Under the Read Uncommitted mode: The transaction connection list can change.
Under the Repeatable Read mode: The pointer to the transaction connection list remains fixed.

Indexes

Explanation: A sorted data structure for fast lookup.

There are two ways to add indexes (index) to a data table:

  1. CREATE INDEX
  2. ALTER TABLE

Indexes also exist on disk and need to be loaded into memory for retrieval when required.

Advantages and Disadvantages of Indexes

Advantages: Speed up query and sorting.
Disadvantages: Creating an index duplicates pointers to table data, which slows down insertion and update speeds.

Reasons Why Indexes Are Slow

  1. Poorly written query statements.
  2. Indexes become ineffective.
  3. Too many join queries.

Clustered Indexes and Non-Clustered Indexes

20160717221315263.jpeg
Clustered indexes have data nodes as their leaf nodes, while non-clustered indexes still have index nodes, but they point to corresponding data blocks.
Generally, the primary key will automatically create a clustered index, and a table allows only one clustered index.

Index Matching

Best Left Prefix Principle

# Large Table Optimization

Database Logical Design Three Normal Forms:

1NF: Requires attributes to be atomic.
2NF: Requires records to have unique identifiers, i.e., no partial dependencies.
In 2NF, non-primary attributes must fully depend on primary attributes.
3NF: No field can be derived from other fields, no field redundancy, i.e., no transitive dependencies.

Each normal form necessarily satisfies the previous normal form. There are also BCNF and 4NF.
If there are cases where key fields determine key fields, it does not meet BCNF.

### Optimization Object: Data tables

Based on business types, data is generally divided into three categories:

  1. Flow-type data.
  2. Stateful data: Multiple business data depend on stateful data, and the correctness of this data must be ensured. For example, when making a payment or recharge, the original balance must be obtained to ensure successful payment.
  3. Configuration-type data.

Depending on the type of data, the optimization way also varies. Here is a summary:

Data Growth Situation Table Type Business Characteristics Optimization Core Idea
Millions of records relatively stable data volume Status table OLTP business direction Don’t split if possible; horizontally scale read demand
Millions of records may reach billions or even higher Flow table OLTP business historical data Business splitting, distributed storage design
Millions of records may reach billions or even higher Flow table OLTP business statistical data source Design distributed expansion for statistical data needs
Millions of records shouldn’t have so much data Configuration table General business Small and concise, avoid monolithic

Vertical and Horizontal Partitioning of Databases and Tables

Vertical partitioning splits a table with many columns into multiple tables.
General partitioning principles:

  1. Place less commonly used fields in a separate table.
  2. Split out text, blob, and other large fields into sub-tables.
  3. Place frequently combined query columns in the same table.

Horizontal partitioning refers to the splitting of rows in a data table. When the number of rows exceeds 2 million, performance decreases. At this point, the data from one table can be split into multiple tables.
Some partitioning principles:

  1. Modulo partitioning: For example, a user table with 4 million records can be split into four tables to improve query efficiency.
  2. Part of the business logic can also be split by region, year, and other fields.

Problems Brought by Sharding Databases and Tables

  1. Transaction consistency issues, leading to distributed transactions.
  2. Cross-node join queries.
  3. Cross-node pagination, sorting functions.
  4. Primary key duplication: Consider using global primary keys.
  5. Common tables (parameter tables, data dictionary tables) need to be saved in each database, and all updates to common tables need to be executed multiple times.

References:
1、MySQL官网结构图: https://dev.mysql.com/doc/refman/8.0/en/pluggable-storage-overview.html
2、MySQL千万级大表优化,看这一篇就忘不掉了!: https://database.51cto.com/art/202002/610289.htm
3、表的垂直拆分和水平拆分:https://www.kancloud.cn/thinkphp/mysql-design-optimalize/39326
4、MYSQL索引:对聚簇索引和非聚簇索引的认识:https://blog.csdn.net/alexdamiao/article/details/51934917#:~:text=%E8%81%9A%E7%B0%87%E7%B4%A2%E5%BC%95%E6%98%AF%E5%AF%B9,%E5%AD%98%E5%9C%A8%E4%B8%80%E4%B8%AA%E8%81%9A%E7%B0%87%E7%B4%A2%E5%BC%95%E3%80%82

从C++到Java —— C系列编程语言辨析总结 数据结构与算法笔记 | 动态规划

Comments

You forgot to set the shortname for Disqus. Please set it in _config.yml.
Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×