I spent the better part of the last 2 weeks tracking down a deadlock in our Database Engine. It’s a little disheartening spending so long on a problem and then having the fix be a simple ‘move this line of code 4 lines higher’ kind of thing. But on the good side I learned a LOT about how our dbengine code works and refreshed my memory on locks and monitors.
For more details follow on to my full post…
Over the years I’ve had to deal with locks and monitors and semaphores in various capacities but I always have to refresh my memory when I get something this complicated on my plate. This time around I found a few pages that really helped. The IBM DeveloperWorks website has some great tools and pages documenting how to use various constructs and language features. I found a 3 page series on pthreads that proved very helpful. Here are the links:
- http://www.ibm.com/developerworks/linux/library/l-posix1.html
- http://www.ibm.com/developerworks/linux/library/l-posix2/
- http://www.ibm.com/developerworks/linux/library/l-posix3/
Our specific problem was a race condition, which is what made it really difficult. For awhile I had to run our unit test suite in a loop from the commandline to reproduce it; like this:just went in, it was
// really? man this sucks. 10 minutes later a hang
for((i=0;i<10000;i++)) ; do ./songbird -p test -test ; done
Fairly quickly I isolated a hanging case in the localdatabase:dynamicplaylist test. I played with how many loops I had to do and could now run this command:
// better but not ideal. 9.5 minutes later a hang
for((i=0;i<1000;i++)) ; do ./songbird -p test -test localdatabase:dynamicplaylist ; done
The problem was that our main thread was executing a database query that was synchronous (most of our queries actually are) and would end up doing a Wait(timeout) on a monitor but the query would never complete so we would always re-enter the Wait() call. The worker thread for the database, that was referenced in the state for the query, was alive and in its queue processing loop waiting for something to get queued up. It was doing a Wait() on another monitor without a timeout.
As I type this with the perfect vision of hindsight it looks very obvious where the problem lies. But the hard part was getting enough data out of the run to build the picture I just described, and overcoming the assumptions I had about how things were working. Because I didn’t yet have a short test case I could run it was very painful gathering data. The race condition meant that stepping through in the debugger was out. Send in PR_LOGGING, Advanced printf() Debugging(tm)!
Then there was the sheer number of monitors and locks we were dealing with, and these are just the ones associated with this specific problem. The classes in question have at least double this number of locks for guarding specific pieces of state.
CDatabaseEngine
- m_pThreadMonitor
CDatabaseQuery
- m_pQueryRunningMonitor (waits on this for completion of sync queries)
- m_StateLock
QueryProcessorThread
- m_pQueueMonitor
- m_pQueueLock (yikes - 2 things guarding the queue!)
- notifies on m_pQueueMonitor when new items added to the queue
QueryProcessor Static method
- waits on pQuery->m_pQueryRunningMonitor and pThread->m_pQueueMonitor
- notifies on pQuery->m_pQueryRunningMonitor when complete
The guilty code happened to be in the QueryProcessor method. It was a weird condition of threads being switched out when they leave a lock and not wrapping enough code in a monitor. So we were getting the length of the queue in a lock, leaving the lock and the entering a monitor. But between leaving the lock and entering the monitor another thread was adding an item to the queue and notifying on the monitor. So by the time the QP code got into the monitor the notification had already fired and it thought there was nothing to do. But everything was guarded with locks and/or monitors so it seems reasonable to think that it was all protected — wrong!
I did finally get a unit test written for this, and it really helped in isolating the fix. It turns out that if you hammer the DB with a bunch of queries, long enough for the background thread to spin up, the sbLocalDatabasePropertyCache’s flushing that causes the hang every time.
// now we're talking, and only takes 15 seconds of running to hang
./songbird -p test -test dbengine:bug6514
The fix is in now, checkout bug 6514 if you want to see the final patch and a little more discussion.













1 Comment
SubscribeVery intriguing post. Thanks for sharing the info
Good for the technically-interested who don’t follow bugzilla, but still want to know something about the might background work you do!