Tuesday, November 24, 2015

Book review: Seven Databases in Seven Weeks

That's a really nice book. From the seven databases that are covered I was familiar with PostgreSQL and only briefly with Neo4j. So the book gave me the chance to explore some more databases and find out about their strengths and weaknesses. In the following paragraphs I'll explain what I found nice and what not so nice about each of them. Before I start: if you are planning to buy this book, I want to warn you that some features are deprecated or even removed, because some of the database systems have evolved since the time the book was written (2012). For example the largest part of the Neo4j chapter is useless, because it doesn't use the Cypher language.

PostgreSQL rocks. It's a very powerful RDBMS and I acknowledge that since I have used it professionally. Postgres is mature, fast, and rock-solid. For those reasons I would choose it for all problems that play nicely with relational DBs. And yes, an RDBMS is not the answer to all problems. For example distributed computations do not fit well into this model. Scaling is limited to making your single DB server/cluster more powerful by upgrading/extending its hardware. And not all problems require full ACID compliance and strict schema enforcement.

Riak is flexible. Being able to interact with a DB using a REST interface and a tool like curl should not be underestimated. What I like about Riak is that you can store whatever resource (be it a document, an image, etc.) you like on the fly and map it to your URL of preference. It just works! I see Riak as a Web filesystem that supports distributed computations through mapreduce. But Riak also supports connecting resources and traversing between those connections (link walking). On the down side configuring, and understanding some Riak concepts (for example conflict resolution and adding indexes) is currently a pain. And you can only find prebuilt binaries for your operating system (Windows is not supported at all) on basho.com.

HBase is unusual. It takes some time to understand the way a column-oriented DB works. What I found great is that versioning is builtin. If you care about data history that's a big deal. Another plus: compression and fast lookups using bloom filters are also builtin. Great features, that can save a lot of time of development. The negatives: no REST interface, complex configuration, and no prebuilt binaries -- you need to compile HBase on your own, so forget Windows unless you like pain.

MongoDB is all about JavaScript. Having the full support of a powerful language like Javascript while using a DB is very valuable. Being able to save JSON documents adds a lot of flexibility since they can nest arbitrary. But this flexibility comes with a cost: updating a document means replacing it without a warning, deleting specific elements of a document is not supported and debugging JavaScript code is a pain. On the contrary: the mapreduce support of Mongo is nice, and it also supports indexing documents. Configuring replicas and sharding is also quite easy. And operating system support is very good.

CouchDB is cute. The Futon Web interface makes CouchDB very user-friendly. Its REST interface and the ability to use curl makes it developer-friendly. Moreover, CouchDB has an interesting approach regarding replication, since all servers are treated equally (no master-slave model). The same is true for conflict resolution: one of the conflicting updates is automatically considered the winner, and this is consistent through all nodes. But that's not necessarily the "correct" update... One last thing: CouchDB is easy to install on all popular platforms.

Neo4j is the graph database. There are simply no competitors when it comes to modelling relationships (think of social networks, movies, food, drinks) using graphs. Neo4j has its own query language (Cypher) and a very nice browser that makes experimenting easy. The documentation is also extensive and interactive. Building a cluster is easy. The negatives: learning curve (new concepts and new language), the enterprise edition is not free (gratis).

Redis is generic. It's not a DB as such, but more an in-memory data structure storage toolkit. Redis is simple to use, fast, and supports transactions. Its commands have strange names though, probably the result of an effort to avoid verbosity. Because it is very generic, Redis can be used as a fast in-memory cache for applications that require high performance.

Final comments: Some people have proposed a better definition of the name NoSQL: Not only SQL. I like this definition. Similar to programming paradigms and languages, different database systems have both strengths and weaknesses. Why not use more than one to achieve our goals? That's the main idea behind the polyglot persistence concept, as suggested by the authors. Polyglot persistence means using more than one databases to target different application layers. For example Redis for caching, Neo4j for modelling relationships, and PostgreSQL for persistence.

Friday, November 20, 2015

Joy Of Coding 2015 Review

Like last year, Joy of Coding 2015 was a great conference. This year the conference took place during May, for once again in Rotterdam. The organisation was similar to that of last year: A few common talks, but also parallel talks and workshops.

The conference this year started with a keynote by Chris Granger (@ibdknox): "Programming as distributed cognition: Defining a super power". I missed the beginning of the keynote but AFAIU Chris wanted to stress the importance of using programming as an exploration tool. In that sense, we should create programming tools that make it easier for scientists to model problems and experiment quickly. His tools Light Table and Eve focus on those aspects.

Next, I watched the presentation "Joy of testing" by John Hughes. The quote of this presentation was "Do not write tests, generate them!". Indeed, using the Erlang version of QuickCheck, John showed a live demo of discovering and fixing bugs using generated tests. John also explained his personal experiences of using the same tools to discover and fix bugs that existed in concurrent Erlang production code (AFAIR the code was used in the automotive industry).

The next speaker was Laurent Bossavit (@Morendil). This keynote was more over psychology than technology. But it seems that there's a deep connection between the two. Laurent suffered by depression and according to him depression is a feature and not a bug. It is very important to be able to debug ourselves, and not just programs. We should stay away from things that make as sad and focus on the things that make us happy. As an example, you might be able to find a COBOL job that pays well, but does COBOL really make you happy? Maybe a job with a lower salary but more fun (think of python, arduino, etc.) is better for you.

The next keynote was about "Mutation testing" by Roy van Rijn. Roy believes that mutation testing, a technique for measuring the quality of unit tests, is better than code coverage. There's an actual Java tool that can be used to explore this area: Judy. A mutant is a version of a program with a modified operator. For example replacing logical AND with logical OR. Killing a mutant means that the incorrect behaviour of the modified code is detected properly and reported, and that's what basically Judy does. I've never tried mutation testing. Maybe one day I will...

I enjoyed the next talk by Crista Lopes (@cristalopes) a lot. Crista is the author of a really nice programming book that anyone who is involved with programming should read: Exercises in Programming Style. The book uses a simple concept: Implement the same program using the  same language (Python) in 33 different styles! A style is basically a form of a programming paradigm (think of object-oriented, functional, procedural, etc.). During the talk Crista demonstrated a subset of the 33 styles of her book. The purpose of Crista's talk (and AFAIU that's also the focus point of the book) was not to compare the different styles and take sides, but to stress the importance of recognising and understanding the different styles. I can't agree more. There's no best programming style for all purposes, and we should be able to work with all of them. BTW there's a GitHub repository with the styles.

The workshop that I picked for Joy of Coding 2015 was about "Property based testing", by Marc Evers, Rob Westgeest, and Willem van den Ende. Property based testing is about the automatic generation of unit tests for a system by describing its properties. The benefit of using property based testing instead of unit testing is that it (a) takes less time since the tests are generated, (b) is more reliable than manual writing since humans tend to forget to cover all possible cases.
During the workshop we used Javascript (NodeJs and JSVerify) and went through several examples.

The closing keynote couldn't be better. A mix of jokes and programming advices by Kevlin Henney (@KevlinHenney), by checking nice (and not so nice) pieces of code written by various programmers in different languages. Studying code written by others is something that is important and we all need to do.

Yet another good year for Joy of Coding. I hope that it will continue to use the same successful recipe in the years to come... :)

Monday, November 9, 2015

Book review: Pragmatic Guide to JavaScript

This was my first JavaScript book and I consider it a good overview covering the pros and cons of the language. The author gives good advice regarding which features of pure JavaScript are fine to use and for which features a framework should be preferred to avoid browser incompatibilities.

Many popular applications are demonstrated (custom tooltips, infinite scrolling, form validation, autocompletion, lightbox, 3rd party APIs) and concepts such as client vs server programming are clearly explained. Christophe's focus on Prototype is not a problem for me. It's his favorite framework and the one that he knows well, so it make sense that he's using it for the demos.

One practical problem: this book is not maintained any more, and as a consequence a few examples are broken, due to a domain that has expired and changes to the Twitter API. I contacted the author on GitHub and he confirmed it. But still, for a book that was published five years ago it's a nice compact guide to people who are familiar with programming and want to focus on the specifics of JavaScript.

Sunday, March 15, 2015

Playing with microcontrollers

The last training course that I followed was about programming microcontrollers. The course was given by Leon van Snippenberg, who has very good expertise in microcontrollers.

For the practical part of the course we used the Microchip dsPIC33F, a 16-bit architecture 40 MHZ microcontroller (system on a chip solution). I admit that I'm not very fond of this proprietary platform, so I enjoyed the theoretical part of the course much more than the practical. I would be more excited if we have used an open hardware solution like arduino, Raspberry Pi, or something comparable.

A few highlights from the course:
  • A three-operand assembly instruction does not necessarily mean that three registers are used. For example ADD W0, W1, W0 uses only one register.
  • Most microcontrollers use the Harvard instead of the Von Neumann architecture. This means that there are two distinct address buses, as well as two data buses (instead of one address and one data bus).
  • When writing code in assembly we should avoid thinking about code optimisation, since the code is usually very fast to execute (but very slow to produce).
  • A common problem when programming microcontrollers is read-modify-write. One way to solve it is using shadow registers.
  • When programming a microcontroller using a C interface and interrupts, it is very important to use the volatile keyword to disable optimisations that might remove code that seems to be dead but is actually used. Because of that, it is also very important to test the code with all compiler optimisation levels enabled, to ensure that it doesn't break.
  • The hardware timers of a platform do not need to follow the same architecture with the processor. For example a platform might use a 16-bit processor with 32-bit timers.
  • Buffers and interrupts are used to solve communication problems between different devices (e.g. a computer communicating with a microcontroller using the serial port).
  • When dealing with non-deterministic problems, disabling interrupts is the most favoured solution.
  • Using a real-time operating system (RTOS) simplifies programming, because we avoid the need to write complex state machines and custom schedulers (those problems are already solved in the RTOS).
  • Multicore support in RTOS is a challenge (unsolved problem?).

We (me and my colleague) challenged Leon by questioning why would one prefer a much more expensive solution like the dsPIC* family of Microchip instead of Raspberry Pi or arduino. The price of the latest Pi is unbeatable. The response was that we should use whatever fits our purpose, and that the Pi manages to achieve such a low price because its makers can estimate in advance the minimum numbers of units that will be sold. Those manufacturing deals are critical in forming the end price of a prototyping platform.

So far I only own an mbed LPC1768 and I'm very satisfied with it. I hope that I'll build some more advanced prototypes in the future, but you have to start from something. I began with flashing LEDs

Continued with adding some basic components like a button

And at some point I built my first practical prototype: a darkness-activated LED

Isn't that nice? In my future posts the plan is to spend more time on explaining the code of prototypes like the last one. For now you can check my mbed repository page.

Monday, February 2, 2015

On writing a book

After reviewing two books about Python, people from Packt asked me if I was willing to write a Python book. I'm glad to see that my first book, Mastering Python Design Patterns is published!

As I expected, writing a book is much tougher than reviewing one. Especially if you have a full-time job, like in my case. I had to deliver a chapter about every week. This is very challenging, since it means that I had to spend many evenings and weekends focusing on delivering a chapter on time.

I hope that my book will be appreciated by the Python (3.x) community. I tried to focus on doing things the Python way instead of reproducing Java-ish or C++-style solutions. To be honest I preferred a different title: I recommended the title "Idiomatic Python Design Patterns" but my proposal was rejected, mainly for marketing reasons.

If you are also considering writing a book, I think that it is a very good idea, but take into account the following:

  • Do you have the time to do it? Unless your book is self-published, you'll need to sign a contract with a publisher and that means that there will be deadlines. Make sure that you discuss it first with your partner/family, since it is a demanding task.
  • Does it fill a missing gap? I don't recommend you to write a book just for the money (yes, you are paid for writing the book and depending on the contract you can also get a share from the sales). I have seen many examples of poorly-written books that were created only because the author wanted to make some money. Don't do it. It might be good for you pocket, but it can harm your reputation, your career, and your psychology (think of bad reviews).
To expand a little bit more on point two: I feel that my book is indeed filling a gap. Although there are other books about Design Patterns in Python, none of them focuses on Python 3. In fact, I reviewed one of them, and apart from targeting only Python 2.x, IMHO it is not using idiomatic Python solutions in many cases.

My book is not perfect in any way. The lack of time meant that some examples had to be smaller and more trivial than expected. But this is part of the game. If you are working full-time and you are writing a book, time is your enemy! Be prepared to make compromises...

Wednesday, January 14, 2015

Course review: SQL Performance

Update: Markus was kind enough to comment on my review. Regarding the "minimise the number of tables to limit joins" he said:
it is often the best approach to store some attribute redundant (e.g. normalised as before plus wherever needed). Maintenance of this redundancy should be delegated to the database whenever possible (e.g. using triggers or materialised view). You should not do that before having those performance problems (avoid "premature optimisation").Reducing the number of joins is a good way to get performance. But only once you are in that situation. And of course, there are other, simpler ways to improve performance that should be leveraged first (e.g, good-old indexing).
So I'm glad that we agree that normalisation is a good thing and that we should only try to find alternative solutions if nothing else (e.g. proper indexing) works.

Markus also made an important comment about the column order in the WHERE clause that is not clarified in my original post:
  • The column order in indexes matters a lot 
  • The column order in the WHERE clause doesn't matter (rare exceptions exist, but generally, it doesn't!).

The original post starts here...

Last October November I followed a course related to the performance of SQL. The course was given by Markus Winand. Although we don't agree on everything (for example I don't like the "create as few tables as possible to minimise joins and achieve better performance" principle because it is against normalisation) Markus has a great knowledge of general and RDBMS-specific performance related issues.

I'm glad that I followed this course. Markus gave us a copy of his book which is very compact and to the point. This is an example of a book that I would never consider reading but it turns out to be a hidden gem. I recommend it to everyone working with relational databases.

It took me some time to write this post because I wanted to read the book first. In this book you will find things that you don't know for sure. For example, did you know that:
  • When building indexes on more than one columns (concatenated indexes), the order of the columns matters a lot?
  • The order of the statements in the WHERE part of a query affects whether a concatenated index is used or not?
  • LIKE expressions with leading wildcards (eg. '%SQL') cannot make use of an index?
  • ORMs can cause big performance problems because of the bad queries that they generate?
  • Selecting only the necessary columns (avoid SELECT *) can improve the performance of joins?
  • An index that covers all the columns of an SQL query (including the columns of the SELECT part) does not need to access any other data structures except the index and improves the performance of a query enormously?
  • ORDER BY and GROUP BY can also be indexed?

The main message of the book is that indexes should be built by us, the developers, not by DBAs or anyone else. That's because only we know how the data are queried, and therefore only we can build the proper indexes.

Personally, I'm very sad to see how many features that are supported by other RDBMS are not supported by MySQL. To mention a few: indexing functions and expressions, partial indexes, indexing using ASC and DESC, window functions. Fortunately, MariaDB is getting there and I hope that we'll switch to it (at work) at some point.

Sunday, November 16, 2014

Course review: Language Engineering with MPS

Last week I followed a two-day course called "Language Engineering with MPS". The course was given by Markus Voelter.

MPS is a free software (using the Apache 2.0 license) framework built on top of Intellij IDEA. Both MPS and Intellij IDEA are actively developed by JetBrains. MPS can be used for implementing Domain-Specific Languages (DSLs), usually by extending a base language which by default is Java. Extending Java is not a requirement. In fact, Markus is involved in the development of mbeddr, which uses a clean version of the C language as the base for targeting embedded system development.

According to Markus textual-based language development tools such as Yacc, lex, Bison, ANTLR, and so forth are fading out because they lack support of an intelligent IDE. Although I'm not fully convinced about this statement I agree that IDE support when developing DSLs is a big plus. Do not overlook IDE support. It gives you (for free) autocompletion, a nice user interface, very readable error messages, instant deployment and debugging, and much more.

During the course we covered only external (context-free) DSLs, because Markus considers internal (context-sensitive) DSLs hacky, since they usually rely on the metaprogramming features of a specific language (Ruby, Lisp, etc.). This is most times either very limited or too complex (for example you end up with unreadable error messages).

Markus has a good knowledge in language design. He gave us some good tips regarding DSL development, such as forbidding Turing-completeness in the DSL to make the static analysis of a code block possible. Another tip was to support many keywords in the DSL (instead of having as few keywords as possible, which is considered good in general purpose languages like C) for giving the chance to the DSL user to provide hints about the performance and behavior of a code block. For example provide two keywords for for loops: the default for is (or actually tries to be) concurrent, while the alternative forseq is always sequential.

Our main course activity was to use MPS for developing an Entities DSL. An Entity is an abstraction that can have a variable number of attributes with validated types. We created our own typing system for that (using Java's typing system as a basis), which supports strings and numbers. An Entity can also have references to other Entities. Finally, we can define functions inside an Entity using the fun keyword. Here's an example of an Entity:

Notice how we can create custom error messages for informing the DSL users when they are trying to do erroneous things such as define a variable with the same name twice. Another error reported (underlined in red on the picture) is when the user tries to return an incorrect type from a function, in this case a string from a function that should return an integer (notice the :number part).

From what I've seen in the course I feel that MPS is an interesting tool with the following pros and cons.

  • Autocompletion.
  • Readable error messages. Even if a message is not very readable you can jump to the source code immediately using a single click.
  • Nice user interface.
  • In general it offers all the goodies of an IDE. Integrated debugging, many ways of searching, refactoring, and so forth.
  • The DSL user (domain expert) needs to install MPS for using our DSL. This usually requires some effort, because we need to create a customized (clean) version of MPS with all development features hidden/disabled to avoid confusing the user.
  • Like all tools, MPS requires time and effort to feel confident with it. Especially typing in the MPS editor can be confusing and frustrating because it is very different from free-text typing which is the usual way of writing code.
  • Documentation. There is only one book targeting explicitly MPS so far.
  • Lag on Windows. The hired laptops that we used during the course were quite powerful but MPS was still lagging on Windows. I have tested it on GNU/Linux and I don't have any issues (and neither did Markus on his MacBook). It seems that MPS has performance issues on Windows.