The Linux Bloke

Who's the Biggest Geek on the Internet?

Browsing Posts in Ruby

I run a number of web sites with WordPress, and have recently upgraded many of them to WordPress 3.0.1. It seems that every time I have to upgrade WordPress that some of the many and varied plugins are simply broken.

Why is this? I mean, in this day and age of well-defined APIs, OOP, proper object factorization, unit testing, and the like, you’d think this would rarely happen, right?

But, alas, the real problem is that most WordPress plugins are simply horribly written. Period. Way too much dependence on global variables, no use of PHP’s OOP capabilities (even though WordPress definitely supports OOP), and just poor code organization all around.

So, whenever WordPress changes something, lots of plugins simply fail to work properly. It’s so prevalent it drives me mad.

Proper software design principles applies whether you are programming in C++, Ruby, Perl, Java, or PHP. We have those design principles in place for a reason. Popular applications are never static, and should never be expected to be either. That’s why we do certain things in a certain way, folks!

Of all the languages I know, I consider PHP to be the absolute worst, basically the Basic of the 21st century. Because just like Basic, PHP allows to get away with many ills. It allows you to write very sloppy code very quickly, and actually get it to work “good enough” to throw into production!!!!

But when you are talking frameworks and plugins, there is no room for being sloppy. When you are talking millions of users that must rely on your plugins, there are simply no place for taking shortcuts. You do the job right, or not at all. You adhere to the well-established, sound design principles that we have worked out over the past 30 years or so, or you go back to “school” to get a clue.

But I really fault PHP for allowing such evils in the first place. Ruby and Python strongly encourages you to do the right thing when you write code. Ruby on Rails is sweet in this regard.

In my experience (and I have 30 years of it!), it takes just as long to write good code as it does to write bad code. With bad code, you spend much more time debugging it, and “fixing” the bugs probably entails writing more sloppy code to work around the existing sloppiness. So time wise, you’re a penny wise, pound foolish.

And then comes maintainability. With poorly written code, forget it. The time it takes to maintain it blossoms exponentially over time as the underpinnings shift and evolve over time. With Open Source development, what usually happens over time is one of the following:

  1. The code is abandoned and everyone stops using it.
  2. The code is re-written from the ground up (and using proper design principles finally!!!!)
  3. The code has become critical to many applications, but no one truly wants to maintain it because it’s so horrible, so it “limps along” with the barest minimum effort applied just to keep it — somewhat — running.

All of which could’ve been avoided if the code was written properly up front. That would free developers to work on more cool stuff, giving us even more functionality, and also allow the underlying frameworks to also grow and expand without worry of breaking all the plugins and themes out there.

So get a clue you bad PHP code slingers out there! It’s not hard at all writing good code, and is actually quite enjoyable. Spend less time playing video games and more time to educate yourself. It’ll look good on your resume and improve your bottom line. And make those who use your code happier. Why? Because your code won’t call attention to itself by not working, and your name is far less likely to become an expletive.

It’s up to you. Only You can write Good Code. If not you, then who else?

Before we get started here, let me state that I am using Ruby 1.9.1 (I refuse to look back!), and that I have not tested this solution on Ruby 1.8.6, but it should work there as well, though I may have some 1.9-isms in my code. Should be easy enough to spot.

I am working on writing an application in Ruby that can talk to an Windows application that has an ActiveX COM Automation object exposed. Ruby is basically the wrapper so that I can access the application from the Linux side of the world. So, I am using Ruby’s DRb to bridge those worlds because, after all, I am the Linux Bloke!

Well, as you may have guessed, I ran into problems with this approach. I simply could not call the COM objects from a call initiated with DRb, though I could call them directly just fine. After scratching my head a bit, I figured it out.

The win32ole module that runs on the Windows side of the world in Ruby only wants to run in the same thread that it was started in. win32ole is simply not thread-safe, and this has to do in large part to how ActiveX works under Windows. No need to delve into the gory details as we want code that works already!

DRb is very much all about threads. The DRb Server runs in a separate thread, and threads are launched each time a DRb request comes in. Threads abound like crazy! After all, it is very clear that the implementation of DRb was based, in part, on the Java threading model and Java’s RMI. But we knew that. We know that Ruby Threads parrot Java Threads. And I’ve done a lot of work with Java Threads in the past and almost feel a bit of “déjà vu” in working with them in Ruby. Oh the days…

But I digress.

We have a major problem here. How do we get around it, without having to throw out DRb and doing something funky like writing some custom RPC bit just to make Windows happy?

Well, as you may have guess, the Linux Bloke created the very solution you need!! Funnel!

Funnel works by wrapping a given object with a “meta” object that can then be called from any thread. All the calls are actually queued up and processed by the thread the target object wants to run in. The calling threads block until the target object returns the call, and the result objects are stuffed somewhere so that the calling thread can find them.

It’s all very transparent and you need not do anything special — much. You will need to call process_funnel_messages() in the funneled thread. And you may do this once in which case process_funnel_messages() will loop forever and never return, or you can call it at regular intervals if you need to do other processing in that same thread.

You, of course, can use Funnel anywhere you need to funnel calls from multiple threads to a single thread to access something that is not inherently thread-safe or thread-aware.

The downloadable code is posted here:

?Download funnel.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
=begin rdoc
Funnel created by Fred Mitchell (LinuxBloke.com) on 2010-06-05                         
 
=Funnel -- funnel calls to an object to a specific thread that created said object.    
 
With some systems, like win32ole, the system basically wants to run on the same thread
the system was started on. To facilitate that need in a multi-threaded environment,
we create the Funnel.                                                                  
 
The Funnel wrapper on an object will basically intercept all method calls and
funnel those calls to the wrapped object in the thread it was created in. The
caller thread will basically block until the Funnel calls the target object's method
and will be given, as a return, the result object of that call.                        
 
The Funnel thread will basically sit in a loop waiting for something to come in,
and wake up to process the entries, then go back to sleep until the next ones come
in.                                                                                    
 
Any exceptions (or errors) that occur in the Funnel shall be
thrown to the caller thread, as though the exception took place in that thread.        
 
This code is released under the GPLv3.                                                 
 
=end                         
 
module Funnel
  class Wrapper
    def initialize(target)
      @targetOb = target
      @targetThr = Thread.current
      @targetThr[:methQueue] = [] if @targetThr[:methQueue].nil?
    end                                                                                
 
    def method_missing(meth, *parms)
      Thread.current[:methResult] = :nothing_yet
      @targetThr[:methQueue] << [@targetOb, meth, Thread.current, parms]               
 
      # Thing is, we may have gotten a response already!
      while Thread.current[:methResult] == :nothing_yet
        if @targetThr.stop?
          @targetThr.wakeup
          # Thread.stop
        end
        Thread.pass
      end
      Thread.current[:methResult]
    end
  end                                                                                  
 
  # Called by the orginal thread to process object messages.
  # This function never returns.
  def process_funnel_messages(loop_forever = true)
    begin
      meth = nil
      (ob, meth, thr, parms) = Thread.current[:methQueue].shift unless Thread.current[\
:methQueue].nil?
      unless meth.nil?
        begin
          thr[:methResult] = ob.send(meth, *parms)
          thr.run
        rescue
          thr.raise($!)
        end
      else
        Thread.stop if loop_forever
      end
    end while loop_forever
  end                                                                                  
 
  def wrap(target)
    Wrapper.new(target)
  end
end

And here is an example of its use:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
require 'funnel'                                                                       
include Funnel                                                                         
 
class StupidThreadUnsafeThing                                                          
  def callme                                                                           
    puts "*** I've been called. My thread is"                                          
    p Thread.current                                                                   
    puts                                                                               
  end                                                                                  
end                                                                                    
 
stut = StupidThreadUnsafeThing.new                                                     
 
# This is the easy to use wrapper                                                      
fstut = wrap stut                                                                      
 
stut.callme                                                                            
 
Thread.new do                                                                          
  10.times do |i|                                                                      
    sleep 1                                                                            
    Thread.new {                                                                       
      puts "XXX #{i} calling stut from thread"                                         
      p Thread.current                                                                 
      fstut.callme                                                                     
    }                                                                                  
  end                                                                                  
  exit                                                                                 
end                                                                                    
 
# Here we loop forever processing messages.                                            
# Optionally, we could call this repeateady                                            
# to process messages by using a parameter of                                          
# "false".                                                                             
process_funnel_messages

This code is fairly straightforward, as you can see. If there is enough interest, I’ll consider turning this into a gem.

It is no joke that computer hardware has advanced by leaps and bounds over the past decade. 10 years ago, multicore systems were expensive and high-end; today, your grandmother may have one and probably have no clue what she has!

Alas, application software has not kept pace. The Linux OS has done a fair job at being able to leverage some of the power multicore systems offer us, but applications running on them have not. The same can be said more or less for Windows, but it’s been a long while since I did anything systems-level with Windows. But the same issues do apply, however.

We are today with the multicore situation where we were in the 80′s and the 90′s with the multithreaded issues. Back then, CPUs grew support for multithreaded programming, but software — including some OSes — were slow to adopt. The Macintosh, when it was first released in 1984, would only support “cooperative” multitasking when the underlying 68000 was perfect for preemption, which was leveraged by the Amiga that came out a year later in 1985.

Even I wrote a preemptive multitasking OS back in 1980, when I was 18 years old! I had a hard time understanding why Apple and Microsoft were having such a hard time with this 5 years later. The first preemptive OS released by Microsoft was Windows NT, that came out in the late 80′s, shortly after the time Commodore closed its doors.

I to this day am still mystified why big corporations like Apple and Microsoft couldn’t pull off until much later what an 18 year-old such as myself could do in a couple of months!

Today, we are in a similar situation with applications — especially those applications that should be able to leverage multicore power such as Databases and Games and Web Servers. Actually, Web Servers such as Apache doesn’t fair too bad in this area from what I can tell. Can’t speak to IIS, since I’ve never used it. The MySQL Database struggles to be able to leverage the multicore, but the developers there seem to only take incremental steps. For a while MySQL’s performance used to actually degrade on multicore platforms! Now it’s been optimized for at least 8 cores, but you still may not see the expected gain on 16 or 32-core systems.

Why would this be the case? Surely, MySQL and other applications are written to be *multithreaded*. Ah, there lies the rub. Without getting deep into the details of system resource allocation, spin-locking, and the like, I wish to discuss this issue from a “high-level” perspective. I’m sure some of you not-so-tech-savvy tech managers would appreciate that!

And, by the way, it’s nothing wrong with not understanding all the nitty-gritty details of semaphores and synchronization issues if you are a tech manager. At least, as long as your people do. But then, that’s part of the problem. Some do; many don’t. And those that don’t may not be as forthcoming as you’d like for fear of being fired or ridiculed or being given a lower status, etc. That’s the way it is.

From the bird’s point of view, it’s not too hard to understand at all. The basic difference in multithreaded programming and multicore programming is this:

If you are doing multithreaded programming for single-core systems, you gain nothing for trying to do a lot of processing simultaneously in multiple threads. The goal of multithreaded programming is to, instead, keep the idle time of the CPU as low as possible whilst doing the most work possible where the work will be done in serial fashion anyway. If you actually do attempt to do real processing in multiple threads, in most cases your performance will actually degrade faster than if you did it using serial programming. Why? Because the processor takes time to context-switch between tasks, and the more it has to do that, the more overhead you’ll incur.

On a multicore system, on the other hand, your goals are quite different — you actually do want to spread the work out among the cores so that it DOES executes simultaneously, because your performance gain should be directly related to how many cores you have. So, a quad core system should be able to get the work done 4 times faster than a single core system. A 16-core platform should be 4-times faster than a quad-core system, and 16-times faster than a single core.

Ah, but as always, there’s a catch. Hence my use of the words, “should be”, rather than “will be”.

Making efficient use of multiple cores in highly non-trivial. For starters, you may not be able to break the tasks down to a parallelizable form. Or if you can, there may still be dependencies between the tasks where one would have to wait on another for information, or wait for a common resource — such as the hard drives or network cards — to become available.

In a dynamic situation, such as a database server, the issues can become even more convoluted as you deal with the order many resources — such as rows in a table — are locked by many tasks running on many cores.

If you are having to deal with legacy code not designed for multicore systems, as was the case with the MySQL codebase, the issues becomes even hairier.

Also, most languages in popular use, such as C++, Java, Python and Ruby, have little to no facillities for multicore or distributed programming.  Interpreted scripting languages like Python don’t even handle multithreading very well, at least Python pre 3.0. Ruby has issues in this regard as well.

The common wisdom with some is to run your program in multiple processes, which does work for those situations that doesn’t require a lot of state dependencies or resource sharing. That approach, when it works, is nice, because it scales well with the number of cores you have — and it also scales well in a cluster/cloud computing scenario.

But if those simultaneously running systems DO require a lot of sharing of data, resources, and other dependencies, the scale factor is severely restricted. It may call for a complete rework of the algorithms involved or some clever system hacks, or both.

I don’t think there is any common wisdom that can be applied in all cases. And when you are dealing with time-to-market constraints, budgetary realities, and the like, you may be forced to take a sub-optimal path.

So it may be safe to say that it may be a little while before we see software truly leverage the true power of multicore systems. We will see it here and there where the effort can be applied and the understanding is present, but the rest will entail a slow evolutionary process.

MIT offers a course on multicore programming. Can’t say how good or bad it is, but it’s MIT. How can you go wrong? :-)

Let’s say you need to do a website that must support multiple languages for cultures as diverse as Japan, France, Russia, Saudi Arabia, and Brazil, as well as the US. This can be quite a daunting task, with all kinds of unexpected gotchas.

The ideal character set of choice is, of course, UTF8. Alas, you will note that most of the systems you’ll need to use defaults to LATIN1, including MySQL. If your site is written in PHP, that also by default is set to LATIN1.

I find it quite puzzling that in this day and age of globalization that many of the tools don’t default to UTF8. And there are major issues with this, because everything in the chain of delivery must either be set to UTF8 or can handle UTF8 or you’ll see bizarreness when you attempt to display the characters of some languages. You will probably see a series of question marks (“??? ??? ?????”) instead of the actual words. Sometimes you may see a series of squares. Or maybe it looks like total garbage.

To debug charset issues, you must be certain that everything in the delivery chain is set for UTF8. I can’t stres this enough.

For example, on one project, the MySQL database was properly set to UTF8, but we kept seeing LATIN1 creep in from somewhere. The site was driven by PHP, and we made sure PHP was set to UTF8, but there were still issues. It turned out that PDO/mysqli was still defaulting to LATIN1, which was revealed by looking at the results of the following query issued through PHP:

SHOW VARIABLES LIKE “character%”;

Which should result in:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

But instead we saw:

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

Clearly demonstrating that there was a connection issue. However, we were able to, as a quick fix, issue the following query on that connection:

SET NAMES ‘utf8′;

Which fixed the problem, though requiring us to run that query on every new connection. I am sure there is a better approach, but we didn’t have time to find it.

But to give you an idea, this is the chain we had to check for UTF8:

  • MySQL Server
  • MySQL Driver/PDO Wrapper
  • PHP
  • Browser

If you are interacting with MySQL through the command-line client, then make sure you launch it thusly:

mysql –default-character-set=utf8

Or have the appropriate settings in the [client] section of my.cnf.

The character set headaches are not just limited to MySQL, but any interacting systems, web services, etc. Carefully checking the chain to ensure that every part of that chain defaults to UTF8 is essential to saving the day for the world of localized globalization!

On a large data migration project that I am currently spearheading, we have a large installed userbase of over 2 million users running on a social networking engine. The schema has been redesigned from scratch, and code is being written to match the new schema, using the all-powerful MySQL database as the system to manage all that data.

Since this social network is global, we need good and reliable location information. The current location model is flawed and full of holes, so we have chosen AssemblySys‘ data to replace it.

We are not using AssemblySys’ schema, as we’ve rolled our own. I’ve designed our new schema to be hierarchial in nature, treating all locations on the planet as ‘nodes’ with a tree relationship, with “Earth” being the parent of all nodes. This model allows us to account for all countries and their idiosyncratic ways they divy up their adminstrative divisions, which to say the least varies a lot.

Currently AssemblySys does not have strong support for postal codes, and only about 5 countries use postal codes anyway. However, I was able to secure zip codes from a different vendor and graft them in to our location model.

The AssemblySys location database is quite through and complete, with accurate geodata for the cities. In fact, it is so complete it even lists some towns that don’t show up on Google Maps! I verified that some of these obscurities I found do, in fact, exist.

And I uncovered a good bit of curious geographical trivia, like the fact that there are 5 towns in Kentucky called “Boston”. Must be a nightmare for the Post Office there! I also found there is a town called “Philadelphia” in South Africa! At first, I thought these must be errors, but I verified that these obscure towns do indeed exist.

Next came the task of transforming their location data to our model. This is  where I had the most problems, because their data is not arranged in the nice, clean, hierarchical fashion our model is. In fact, it’s laid out in a very cumbersome fashion requiring a number of sub-keys to cull out the proper hierarchy.

To their credit, though, AssemblySys was quick to respond to my questions about how to access their data and shot back examples that was very helpful with the effort. But I felt their model was way too complicated than it needed to be, and perhaps could have used a bit more normalization. But I was able to do the transform after a few days of wrestling with it.

Overall, I am pleased with the quality of the AssemblySys product. I am not happy with their schema layout and the rather obtuse and complicated queries to cull out the structure. However, perhaps most users will use their database as is and perhaps it works better in that context, though the queries can get quite cumbersome from my estimation. The service is good, though completely email-based. The price is reasonable and the data is accurate.

If you deal with databases for a living, eventually you’ll come across cases where you’ll need to migrate a lot of data from one schema to another. I am not just talking about migrating from one different type of database to another, like from Oracle to MySQL, but from, for instance, a badly-designed schema to one more expertly crafted.

If there are minor differences between the source and target schema, this is a trivial affair. On the other hand, if the schema is completely different, this can be quite a challenge. Moreover, the database being migrated might represent a high-demand website that will need to be done with little or no downtime, with lots of planning and preparation to boot. You may be interacting with the application developers, the systems crew, and juggling tight deadlines as well.

Well, as you may have guessed, I have described some of the roles I now play at a leading social networking company. We are indeed in the midst of creating the “NextGen” product — a complete rewrite and redesign. The new system is designed with modularity and scalability in mind. The old system we are transitioning from was created when the company was much smaller and had 2 orders of magnitude or more less demand. Suffice it to say, it has all the appearances of being crafted by a bunch of “juniors” that just quickly browsed through “PHP for Dummies”, “Database Design for Idiots”, and the like the night prior. That the aging application still works at all is seen as the “8th Wonder of the World”, but to it’s credit it brings in millions in revenue despite all of its faults.

I am an “old veteran” when it comes to software development. In my “advance age”, I’ve decided to do databases as something that I’ve not done before in my 30-year career as software developer. The nice thing is that I find much I’ve learned about algorithms and data structures can also be applied to schema design. It also helps with interacting with the applications development team as I can relate to what their needs are and “bridge the gap”, as it were between the code and the database.

I have chosen Ruby out of all the languages I know — Python, Perl, PHP, C++,Java, etc. — because of it’s expressive power and meta-programming capabilities that most of the other languages don’t either do well, or lack  a clean syntax to accomplish the same.

First, let me speak of my general approach to data migration. You have your source and destination databases. Of your source databases, you will obviously have the main database containing the enterprise’s lifeblood information. Some of that data will relate directly to customer/account activity; some may relate to configuration of how that data is handled; other data may serve as a reference, such as a zip-code database.

Similarly, you will also have target databases, with the same type of data, but organized differently — hopefully more efficiently. Also, what may have been denormalized in the source database you might choose to normalize it in the target, or vice-versa. Perhaps password for user accounts were in plaintext in the source and now you need to md5 them in the target.  Perhaps there were a fixed number of columns in the source tables representing some resource that you wish to store as separate rows in the target for added flexibility and expandability. Again, if you are only dealing with a couple of tables, it’s trivial to do the migration. If, on the other hand, you are dealing with dozens of tables, the problem explodes in complexity.

Since I want to illustrate doing a migration, I don’t want to bog you down with a complex schema; instead, I will take a simple example. Suppose you have a picture display site where each picture was represented by a column in the users table, and you need to migrate this to a more flexible system that will allow any number of pictures per user. If you have 10 million users in this table, doing a ALTER TABLE every  time needed to expand on the number of pictures would be just plain silly.

1
2
3
4
5
6
7
8
9
10
CREATE TABLE old_accounts (
  id INT auto_increment primary key,
  name varchar(100) not null,
  email varchar(100),
  picture1 varchar(100),
  picture2  varchar(100),
  picture3  varchar(100),
  picture4  varchar(100),
  picture5  varchar(100)
) ENGINE=MyISAM;

And here is the new schema we wish to migrate this to:

1
2
3
4
5
6
7
8
9
10
11
12
13
CREATE TABLE new_account(
  userID INT auto_increment primary key,
  given varchar(50) not null,
  sur varchar(50) not null,
  email varchar(100)
  ) ENGINE=InnoDB;
 
CREATE TABLE pictures (
  pictureID int not null auto_increment,
  userID INT not null,
  url varchar(100) not null,
  unique index(userID, url)
) ENGINE = InnoDB;

I have deliberately left out the foreign key specifications for clarity — and some would argue it would be a nasty performance hit under some circumstances, though I’ve not run into that problem personally.

I have written a complete Ruby framework specifically for migration, but as of the time of this writing, that code is proprietary and not yet released to open-source, though eventually I may do that if I get clearance. But basically, I use Ruby classes to represent a “unit” of migration — normally a single source table to one or more target tables. So, using my Migration framework, here’s what this migration would look like in Ruby:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class UserMigration < Migration
    def migrate_map
        @src_table = {
             :old_accounts => {:PK => :id}
        }
        @dest_table = {
            :new_account = {
                :PK => :userID,
                :id => :userID,
                :name => :given,
                :email => :email,
            },
 
            :pictures => {
                :PK => :pictureID,
                :FK => {:new_account => {:userID => :userID}}
                :picture1 => :url,
            },
 
           :pictures => {
                :PK => :pictureID,
                :FK => {:new_account => {:userID => :userID}}
                :picture2 => :url,
            }, ... 
        }
    end
end

Well, that’s it — almost, and there’s a problem in the Ruby code that you will catch right off the bat if you know Ruby — and I think that if you look at it for a bit, you can figure out what’s going on here. So I’ll leave that as an exercise for you to mull over. You don’t really need to know Ruby at all to understand what’s going on here, and that’s the bit I like about Ruby. You can use it as a type of “meta-language” if you know what you’re doing.