# Prefixed index in MySQL database

During your developer career you probably experienced the following MySQL error message:

SQL Error (1170): BLOB/TEXT column ‘postContent’ used in key specification without a key length

Usually, this error appears when you want to add your column to the database index, but column type is BLOB/TEXT or long VARCHAR. This error can also arise if you already have an index defined on a column but you want to change column’s type to TEXT/BLOB by altering table structure.

If you want to change the data type of your column of your indexed column then you can read my post about How to choose right data type for primary key in MySQL database.

MySQL disallows indexing a full value of these columns because data they contain can be huge, and implicitly DB index would be big, and you will not benefit from it, and index maintenance is hard. Because of that, MySQL requires that you define first N characters to be indexed, and the trick is to choose a number *N* that’s long enough to give good selectivity, but short enough to save space. The prefix should be long enough to make the index nearly as useful as it would be if you’d indexed the whole column.

Before we go further let us define some important terms. * Index selectivity* is ratio of the total distinct indexed values and total number of rows. Here is one example. Here is our test table:

id | value |
---|---|

1 | abc |

2 | abd |

3 | acf |

If we index only first character N=1, then index table will look like the following table:

index | rows |
---|---|

a | 1,2,3 |

Now it is easy to calculate index selectivity as ration between number of distinct rows in *index *table and total number of rows in *test *table ** IndexSelectivity = 1/3 = 0.33**.

Please note that Index selectivity is value between [0,1].

If we increase a number of indexed characters to *N=2* then index table will look like:

index | rows |
---|---|

ab | 1,2 |

ac | 3 |

Now, index selectivity is equal to 2/3=0.66 but index table is bigger. You get the idea? The trick is to find minimal number N which will have the same index selectivity as if we have indexed the complete row and not only first N characters.

There are two possible approaches you can calculate this for your table and define the number N. I will use a test database in order to demonstrate how to perform calculations. Her is database dump I used.

Let’s say we want to add column *last_name *in table *employees* to the index, and we want to define the smallest number *N* which will produce the best index selectivity.

First let us identify the most frequent last names:

1 |
select count(*) as cnt, last_name from employees group by employees.last_name order by cnt |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
+-----+-------------+ | cnt | last_name | +-----+-------------+ | 226 | Baba | | 223 | Coorg | | 223 | Gelosh | | 222 | Farris | | 222 | Sudbeck | | 221 | Adachi | | 220 | Osgood | | 218 | Neiman | | 218 | Mandell | | 218 | Masada | | 217 | Boudaillier | | 217 | Wendorf | | 216 | Pettis | | 216 | Solares | | 216 | Mahnke | +-----+-------------+ 15 rows in set (0.64 sec) |

As you can see, the last name *Baba *is the most frequent one. Now we are going to find the most frequently occurring *last_name *prefixes, beginning with five-letter prefixes.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
+-----+--------+ | cnt | prefix | +-----+--------+ | 794 | Schaa | | 758 | Mande | | 711 | Schwa | | 562 | Angel | | 561 | Gecse | | 555 | Delgr | | 550 | Berna | | 547 | Peter | | 543 | Cappe | | 539 | Stran | | 534 | Canna | | 485 | Georg | | 417 | Neima | | 398 | Petti | | 398 | Duclo | +-----+--------+ 15 rows in set (0.55 sec) |

There are much more occurrences of every prefix, which means we have to increase number N until the values are almost the same as in the previous example.

Here are results for N=9

1 |
mysql> select count(*) as cnt, left(last_name,9) as prefix from employees group by prefix order by cnt desc limit 0,15; |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
+-----+-----------+ | cnt | prefix | +-----+-----------+ | 336 | Schwartzb | | 226 | Baba | | 223 | Coorg | | 223 | Gelosh | | 222 | Sudbeck | | 222 | Farris | | 221 | Adachi | | 220 | Osgood | | 218 | Mandell | | 218 | Neiman | | 218 | Masada | | 217 | Wendorf | | 217 | Boudailli | | 216 | Cummings | | 216 | Pettis | +-----+-----------+ 15 rows in set (0.61 sec) |

This means index will be a bit slower if we are searching last name ‘Schwartzb’ but in any other case, index selectivity is really good. Here are results for N=10.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
+-----+------------+ | cnt | prefix | +-----+------------+ | 226 | Baba | | 223 | Coorg | | 223 | Gelosh | | 222 | Sudbeck | | 222 | Farris | | 221 | Adachi | | 220 | Osgood | | 218 | Mandell | | 218 | Neiman | | 218 | Masada | | 217 | Wendorf | | 217 | Boudaillie | | 216 | Cummings | | 216 | Pettis | | 216 | Solares | +-----+------------+ 15 rows in set (0.56 sec) |

This are very good results. This means that we can make index on column *last_name *with indexing only first 10 characters. In table definition column *last_name *is defined as VARCHAR(16), and this means we have saved 6 bytes (or more if there are UTF8 characters in the last name) per entry. In this table there are 1637 distinct values multiplied by 6 bytes is about 9KB, and imagine how this number would grow if our table contains millions of rows.

Other way to calculate index selectivity is to calculate it as per definition. *IndexSelectivity = DinstinctRows/TotalRows*. Here is how to calculate it:

1 |
select count(DISTINCT last_name)/count(last_name) from employees |

1 2 3 4 5 6 |
+--------------------------------------------+ | count(DISTINCT last_name)/count(last_name) | +--------------------------------------------+ | 0.0055 | +--------------------------------------------+ 1 row in set (0.47 sec) |

Let’s calculate IndexSlectivity when index is based on first N characters where N is changed from 5 to 10. We are trying to find result closest the one in the table above.

1 2 3 4 5 6 7 8 |
select count(DISTINCT last_name)/count(last_name) as originalSelectivity, count(DISTINCT left(last_name,5))/count(last_name) as prefix5, count(DISTINCT left(last_name,6))/count(last_name) as prefix6, count(DISTINCT left(last_name,7))/count(last_name) as prefix7, count(DISTINCT left(last_name,8))/count(last_name) as prefix8, count(DISTINCT left(last_name,9))/count(last_name) as prefix9, count(DISTINCT left(last_name,10))/count(last_name) as prefix10 from employees; |

1 2 3 4 5 6 |
+---------------------+---------+---------+---------+---------+---------+----------+ | originalSelectivity | prefix5 | prefix6 | prefix7 | prefix8 | prefix9 | prefix10 | +---------------------+---------+---------+---------+---------+---------+----------+ | 0.0055 | 0.0051 | 0.0054 | 0.0054 | 0.0054 | 0.0055 | 0.0055 | +---------------------+---------+---------+---------+---------+---------+----------+ 1 row in set (2.42 sec) |

From results we can see how index selectivity is changing as we are increasing number of indexed characters N. It is obvious that for N=9 or N=10 index selectivity is equal as we have indexed all characters from *last_name *column.

Benefits of this approach are obvious and I suggest you to test this approach on real world database.

I’d love to get your comments below or you can email me at **code.epicenter at gmail.com**.

wow nice explanation bro’