The Angry Dome- Stale data in puppet dashboard

S	M	T	W	H	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Stale data in puppet dashboard

by: Jeremy
file under: geekery
at: Jul 24 2012 15:41
1 Comments (post new)

I use Puppet Dashboard as my external node classifier.

Once you have a few hundred nodes, it can be very beneficial to use an ENC instead of maintaining node definitions directly in your puppet configuration. For one thing, it helps to visualize the status of nodes, and for another it is often useful to logically group nodes and visualize those groups. Most (although not all) of what an ENC can do is possible with flat files and clever includes or realize statements, but having a web UI like Dashboard is a huge win in legibility when managing groups of nodes.

I would describe Dashboard as being "pretty good." It's not the only ENC out there - and some of the others are highly recommended - but I have gravitated towards it since it is maintained by Puppet Labs rather than a third party. It's certainly got some rough edges, and some missing functionality, but it gets the job done and it's OSS.

My biggest issues with Dashboard have generally been related to performance. Over time, the thing starts running like a dog on the VMs on which I typically run my puppet master.

You might notice this yourself: Dashboard runs fine at first, but over time it becomes less and less responsive. Puppet runs themselves seem OK, but actual page loads in the UI are terrible.

The biggest reason for this is the (nifty) reporting tables, which store data about puppet runs. Unfortunately, barring your intervention, these tables will grow indefinitely; and data about e.g. year old puppet runs is arguably not very valuable. If you read the actual docs, you'll note that Dashboard includes a rake task which you should run via cron job to prune reports and keep the sizes of these tables sane:

rake RAILS_ENV=production reports:prune upto=7 unit=day

Obviously, you can keep more or less data depending on your hardware. Our hardware is virtual and scarce, and keeping things lean is crucial. We run at a rate of roughly 80,000 reports per week, and running this task nightly keeps us from ever greatly exceeding that.

Of course, there are some gotchas.

Until 2011, the Rake task did not prune resource_statuses, and you had to manually deal with that (thankfully, this functionality is now included in the rake task). If you're running an older puppet dashboard, and things are slowing to a crawl... time to upgrade, man! But barring that you should at least prune the resource_statuses table with your own SQL.

If you're running on meager hardware, you should really be using Ruby Enterprise to make Rails performance suck a little bit less. It's even possible to use the RPM version of dashboard (and of puppet itself) along with REE with a little tinkering.

Even doing this stuff, though, we had some trouble. In the web UI, loading node pages started taking longer and longer. And I saw this in our slow.log:

# Query_time: 7.419537 Lock_time: 0.000077 Rows_sent: 1 Rows_examined: 4766330
SELECT count(*) AS count_all FROM `timeline_events` WHERE ((subject_id = 398 AND subject_type = 'Node') OR (secondary_subject_id = 398 AND secondary_subject_type = 'Node'));

Well now, that's an expensive little query. And it's run on every node page load.

Honestly, I didn't even know what timeline_events were, but that's a hell of a lot of rows to examine. One thing you could do would be to add indexes:

ALTER TABLE timeline_events ADD INDEX indexsubjectid ( subject_id , subject_type ) ;
ALTER TABLE timeline_events ADD INDEX indexsecondary ( secondary_subject_id , secondary_subject_type );

Which helps, but... do we really need all these entries in there anyway?

It turns out this table is used to track when node objects in Dashboard change. This can happen when you actually edit a node in the web UI, which is good to know (Hey, why did this node suddenly crap itself? Oh, looks like somebody jacked with it in the ENC and should be reprimanded! Thanks timeline_events!). It also happens whenever a node creates a report... which is redundant, since reports have their own table, which (in my case) I'm keeping very lean.

So timeline_events gets a new entry every time puppet runs, on every puppet node. And this table doesn't actually have any data about what happened - it just says "hey, something was updated" - which is, um, not very interesting, when we can get the complete skinny from the report table. I'll leave it to you to decide how much value this data has to your organization, but I personally decided "not much" and did this:

mysql> delete from timeline_events where created_at <= DATE_SUB(now(), INTERVAL 1 MONTH );

Boom. Fast again! Looks like we need another cron job...

Comments

puppetmaster @ Fri Jun 21 14:26:02 -0400 2013

Hey, great little doc.

I run puppet master 1.2.23, just upgraded from 1.2.13 (which was slow too). I was hoping the newer version would of had some type of performance enhancements. But no, the new version is painfully slow. Even a rake routine to add a node manually takes 5-6 seconds. I tried to add an index, and even tried to remove the timeline_events records. But no go. Am not using ruby enterprise, but my puppet software is from RPM's. On Centos 6 server, fairly modern kernel. Have tried the usual rake routines to optimize and prune. No difference at all. The prune rake routing takes 5 minutes???? Thank god we don't run a bank on the dashboard :)

A page refresh for a listing of nodes takes roughly 10-11 seconds. Is that normal? Is it a ruby/rails thing or is it a puppet dashboard code thing? Mysql obvioulsy can move the bits around pretty well. Is this type of reponse normal? Am I out of luck or is there something terribly wrong?

Thanks

sort of "puppetmaster".

New Comment

Comments |Back

The Angry Dome

Archives

Stale data in puppet dashboard

Comments

puppetmaster @ Fri Jun 21 14:26:02 -0400 2013

New Comment