Monday, February 2, 2015

On writing a book

After reviewing two books about Python, people from Packt asked me if I was willing to write a Python book. I'm glad to see that my first book, Mastering Python Design Patterns is published!

As I expected, writing a book is much tougher than reviewing one. Especially if you have a full-time job, like in my case. I had to deliver a chapter about every week. This is very challenging, since it means that I had to spend many evenings and weekends focusing on delivering a chapter on time.

I hope that my book will be appreciated by the Python (3.x) community. I tried to focus on doing things the Python way instead of reproducing Java-ish or C++-style solutions. To be honest I preferred a different title: I recommended the title "Idiomatic Python Design Patterns" but my proposal was rejected, mainly for marketing reasons.

If you are also considering writing a book, I think that it is a very good idea, but take into account the following:

  • Do you have the time to do it? Unless your book is self-published, you'll need to sign a contract with a publisher and that means that there will be deadlines. Make sure that you discuss it first with your partner/family, since it is a demanding task.
  • Does it fill a missing gap? I don't recommend you to write a book just for the money (yes, you are paid for writing the book and depending on the contract you can also get a share from the sales). I have seen many examples of poorly-written books that were created only because the author wanted to make some money. Don't do it. It might be good for you pocket, but it can harm your reputation, your career, and your psychology (think of bad reviews).
To expand a little bit more on point two: I feel that my book is indeed filling a gap. Although there are other books about Design Patterns in Python, none of them focuses on Python 3. In fact, I reviewed one of them, and apart from targeting only Python 2.x, IMHO it is not using idiomatic Python solutions in many cases.

My book is not perfect in any way. The lack of time meant that some examples had to be smaller and more trivial than expected. But this is part of the game. If you are working full-time and you are writing a book, time is your enemy! Be prepared to make compromises...

Wednesday, January 14, 2015

Course review: SQL Performance

Update: Markus was kind enough to comment on my review. Regarding the "minimise the number of tables to limit joins" he said:
it is often the best approach to store some attribute redundant (e.g. normalised as before plus wherever needed). Maintenance of this redundancy should be delegated to the database whenever possible (e.g. using triggers or materialised view). You should not do that before having those performance problems (avoid "premature optimisation").Reducing the number of joins is a good way to get performance. But only once you are in that situation. And of course, there are other, simpler ways to improve performance that should be leveraged first (e.g, good-old indexing).
So I'm glad that we agree that normalisation is a good thing and that we should only try to find alternative solutions if nothing else (e.g. proper indexing) works.

Markus also made an important comment about the column order in the WHERE clause that is not clarified in my original post:
  • The column order in indexes matters a lot 
  • The column order in the WHERE clause doesn't matter (rare exceptions exist, but generally, it doesn't!).


The original post starts here...

Last October November I followed a course related to the performance of SQL. The course was given by Markus Winand. Although we don't agree on everything (for example I don't like the "create as few tables as possible to minimise joins and achieve better performance" principle because it is against normalisation) Markus has a great knowledge of general and RDBMS-specific performance related issues.

I'm glad that I followed this course. Markus gave us a copy of his book which is very compact and to the point. This is an example of a book that I would never consider reading but it turns out to be a hidden gem. I recommend it to everyone working with relational databases.

It took me some time to write this post because I wanted to read the book first. In this book you will find things that you don't know for sure. For example, did you know that:
  • When building indexes on more than one columns (concatenated indexes), the order of the columns matters a lot?
  • The order of the statements in the WHERE part of a query affects whether a concatenated index is used or not?
  • LIKE expressions with leading wildcards (eg. '%SQL') cannot make use of an index?
  • ORMs can cause big performance problems because of the bad queries that they generate?
  • Selecting only the necessary columns (avoid SELECT *) can improve the performance of joins?
  • An index that covers all the columns of an SQL query (including the columns of the SELECT part) does not need to access any other data structures except the index and improves the performance of a query enormously?
  • ORDER BY and GROUP BY can also be indexed?

The main message of the book is that indexes should be built by us, the developers, not by DBAs or anyone else. That's because only we know how the data are queried, and therefore only we can build the proper indexes.

Personally, I'm very sad to see how many features that are supported by other RDBMS are not supported by MySQL. To mention a few: indexing functions and expressions, partial indexes, indexing using ASC and DESC, window functions. Fortunately, MariaDB is getting there and I hope that we'll switch to it (at work) at some point.

Sunday, November 16, 2014

Course review: Language Engineering with MPS

Last week I followed a two-day course called "Language Engineering with MPS". The course was given by Markus Voelter.

MPS is a free software (using the Apache 2.0 license) framework built on top of Intellij IDEA. Both MPS and Intellij IDEA are actively developed by JetBrains. MPS can be used for implementing Domain-Specific Languages (DSLs), usually by extending a base language which by default is Java. Extending Java is not a requirement. In fact, Markus is involved in the development of mbeddr, which uses a clean version of the C language as the base for targeting embedded system development.

According to Markus textual-based language development tools such as Yacc, lex, Bison, ANTLR, and so forth are fading out because they lack support of an intelligent IDE. Although I'm not fully convinced about this statement I agree that IDE support when developing DSLs is a big plus. Do not overlook IDE support. It gives you (for free) autocompletion, a nice user interface, very readable error messages, instant deployment and debugging, and much more.

During the course we covered only external (context-free) DSLs, because Markus considers internal (context-sensitive) DSLs hacky, since they usually rely on the metaprogramming features of a specific language (Ruby, Lisp, etc.). This is most times either very limited or too complex (for example you end up with unreadable error messages).

Markus has a good knowledge in language design. He gave us some good tips regarding DSL development, such as forbidding Turing-completeness in the DSL to make the static analysis of a code block possible. Another tip was to support many keywords in the DSL (instead of having as few keywords as possible, which is considered good in general purpose languages like C) for giving the chance to the DSL user to provide hints about the performance and behavior of a code block. For example provide two keywords for for loops: the default for is (or actually tries to be) concurrent, while the alternative forseq is always sequential.

Our main course activity was to use MPS for developing an Entities DSL. An Entity is an abstraction that can have a variable number of attributes with validated types. We created our own typing system for that (using Java's typing system as a basis), which supports strings and numbers. An Entity can also have references to other Entities. Finally, we can define functions inside an Entity using the fun keyword. Here's an example of an Entity:



Notice how we can create custom error messages for informing the DSL users when they are trying to do erroneous things such as define a variable with the same name twice. Another error reported (underlined in red on the picture) is when the user tries to return an incorrect type from a function, in this case a string from a function that should return an integer (notice the :number part).

From what I've seen in the course I feel that MPS is an interesting tool with the following pros and cons.

Pros:
  • Autocompletion.
  • Readable error messages. Even if a message is not very readable you can jump to the source code immediately using a single click.
  • Nice user interface.
  • In general it offers all the goodies of an IDE. Integrated debugging, many ways of searching, refactoring, and so forth.
Cons:
  • The DSL user (domain expert) needs to install MPS for using our DSL. This usually requires some effort, because we need to create a customized (clean) version of MPS with all development features hidden/disabled to avoid confusing the user.
  • Like all tools, MPS requires time and effort to feel confident with it. Especially typing in the MPS editor can be confusing and frustrating because it is very different from free-text typing which is the usual way of writing code.
  • Documentation. There is only one book targeting explicitly MPS so far.
  • Lag on Windows. The hired laptops that we used during the course were quite powerful but MPS was still lagging on Windows. I have tested it on GNU/Linux and I don't have any issues (and neither did Markus on his MacBook). It seems that MPS has performance issues on Windows.

Saturday, November 8, 2014

Two less common tricks for improving unexplained slow MySQL queries

Recently I faced an SQL performance issue. What I wanted to do was rather common: Apply the (set) difference operation on two tables.

In relational algebra, the difference operation applied on two tables A and B gives as a result a new table C that contains all the elements that are in table A but aren't in table B.

That's a very common operation. A common example is having the table Students and the table Grades. To find all Students that have not been graded yet you can use the difference operation. Or in terms of set theory and using \ as the notation of difference:

{1,2,3} ∖ {2,3,4} = {1}

Some RDBMS have the difference operation built-in using the EXCEPT keyword. So the query in this case would be something like:

But that's not the case for MySQL. MySQL does not support EXCEPT but we can get the same result using a LEFT (OUTER) JOIN:
OK, so I used a LEFT JOIN to find out that the query was painfully slow. One table had 700 thousand records and the other 130 thousand records. For a relational database that's not a big deal (only a few seconds, let's say maximum 3).

If you search on the Web for slow LEFT JOIN query you'll see that everyone recommends (a) adding indexes and (b) using the SQL optimizer. Well, I had already done both things without achieving my goal. I added the indexes using CREATE INDEX and then used the optimizer by adding EXPLAIN in front of my query and made sure that the indexes were used properly.

So, what's left? Actually there are two other important things to check. The first is to inspect the output of SHOW PROCESSLIST. This will show you the list of active processes on the server. When writing queries killing the SQL client because it crashed/became unresponsive due to a bad query is not unusual. But killing the client does not necessarily mean that the query is killed. There might still be orphan queries that eat the resources of your server but you have no control of them. You can kill them using KILL PROCESS_ID (replace PROCESS_ID with the actual ID of the orphan process).

The second thing that really impacts the performance of MySQL is when trying to join tables that use different collations. MySQL uses by default the legacy latin1_swedish_ci collation, so if one of the tables you are trying to join is using a different collation (for example utf8_unicode_ci, which makes much more sense as a default nowadays) the joins become terribly slow. Just make sure that all database tables use the same collation.

Happy querying!

Monday, September 29, 2014

Book review: SQL Antipatterns

In SQL Antipatterns, Bill Karwin does a great job in explaining how to make efficient use of the relational model, instead of abusing it like it's usually done. Many developers abuse relational databases using antipatterns such as ID required, entity-attribute-value, index overkill, and so forth (you should read this book without second thought). Some developers go as far as trying to implement a search engine based on the LIKE keyword.

All those are examples of not using the relational model properly. If we want to bypass referential integrity or save everything in one table, then we should not use a relational DB in the first place. Cursing on the performance of a relational DB when it's not properly used is very wrong.

Sunday, May 4, 2014

Being a technical reviewer (again)

A few months ago I experienced (for the first time) how it feels to be part of a technical reviewing team. I reviewed a packt introductory book about Design Patterns.

Today I'm glad to see that another packt book, of which I was for once again one of the technical reviewers, has been published. The book is called Mastering Object-oriented Python. It's a book that focuses on writing OOP code in Python 3.

I would recommend this book to all Python programmers, both beginners and advanced. It covers all aspects of the language (to mention a few: special methods, unit testing, decorators, serialization, etc.) and shows different possible designs, explaining the pros and cons of each design. What I really like is that the code in the book is written in a Pythonic style, and the author makes a good job at explaining how Python differs from Java/C++.

A few warnings: This is a big book (~ 600 pages). You can read the whole thing, but I believe that it will be much more useful as a handbook. Also note that the book assumes familiarity with Python 3 and Design Patterns.

You might wonder why I accepted to do again a technical review. Some people find technical reviewing a waste of time. But I disagree. To become a good programmer, you need to read a lot of code instead of just writing. In fact, programmers tend to read much more existing code than write new code. If reading code is important, reading good code is much more important. And I believe that the code in this book is well written.

Saturday, April 19, 2014

BASH: syntax error near unexpected token `('

After making some portability and readability improvements to shell-utils, I used BASH, sh, and dash to test it. While sh and dash were fine, BASH returned the error:

line 358: syntax error near unexpected token `('
line 358: `ls ()'

That is strange. BASH usually introduces shell portability issues because of the extra features it provides, so I would expect to have a problem with the other shells.

It turns out that BASH did a pretty good job with reporting the source of the error. Note that shell-utils redefines a few everyday commands as functions, to make them more verbose and secure safe (eg. ls becomes ls --color=auto, rm becomes rm -i, etc.). But usually those everyday commands are already defined as aliases in .bashrc. Aliases are evaluated before functions, and defining a function that has the same name as an alias is not allowed. And that's what BASH is trying to tell me in the error message. Commenting all aliases in .bashrc fixed the issue :)