return to main page

week=8
This week is beyond hectic. I am beginning to realize just exactly how much I need to do this semester and how many things I need to prepare before school even starts in two weeks. On top of that, I need to wrap things up with this project. There are a lot of things that will be left undone and unexplored simply because there isn't the time. And on top of that still, I have a million things to do to get ready to move out.

And then, on top of all of that, I'd like to spend some of the time that I have left here with all of the fantastic friends I've made. It truly has been a wonderful experience, with all of the things I've learned and people I've met. I am really disappointed that I have to cut my time here so short.

But pressing on. Onwards and upwards I hope. I can already tell next year will be intense. But I would like to look into building off of the work I've done here, so I will be sure to post updates.
week=8
Not surprisingly, this week was not so productive in terms of this research given that I nearly killed myself on it last week, and also because with working on all of that so much, I have neglected a lot of other things that I have to get ready before school starts. A lot of this week was spent on organizing funding and events for the ACM-W and looking into what I am going to need to do to get ready to start applying to graduate schools and fellowships.
week=8
I just barely survived this week, running on gallons of Red Bull and about three hours of sleep per night. It was absolutely necessary but it paid off as I managed to finish implementation of the simple policies and get some interesting if, limited, results, and make my deadline with over an hour to spare. In the end, I think it turned out well. You can see the material we submitted here. Here is the abstract we submitted:

Data-intensive distributed science applications depend on efficient access to data sets and to high performance computational resources. Data sets generated on an experimental apparatus or on computational resources must be distributed widely to scientists in the collaboration, often according to policies set by the collaborating institutions. These policies pertain to the dissemination, security, and reliability of data sets.

In this work, we integrate an open source rule engine with existing services for grid data management to perform policy-driven data distribution. We implement and evaluate two realistic distribution policies for distributed science applications. The first policy specifies a tier-based pattern to distribute published data products in a manner similar to that used in high energy physics applications. The second policy maintains a specified number of replicas for each file. Our initial results indicate that a rule engine is well-suited to the problem of policy-based data management for distributed science applications.

Now I still have plenty to do but it feels like such a relief to make it past that deadline. The week was very long. I rejoiced mightily on Thursday.
week=8
This week I spent still trying to get things implemented. I seem to have severely underestimated the time it would take me to do this. While conceptually my task is simple, there always seems to be a lot of little unforeseen glitches that end up wasting several hours and setting me back. For example, I spent some time trying to figure out how to properly authenticate my machine on the cluster I want to use, and another bit of time figuring out that Eclipse was for some reason caching old copies of rule files instead of getting the new ones, and more little irritating things like that.

I am exhausted from having a very fun weekend with my friend and from feeling bogged down by these little problems. I was supposed to have all of my simple policies implemented at this point, which I thought would be easy, but they still aren't ready. I was thinking I'd even be able to try some harder policies, but that's clearly out of the question with my deadline coming up next week and my simple policies still unimplemented. Next Thursday is my deadline. Next week will not be fun.
week=8
My best friend from back home came to visit me on Friday, so I only had a four day work week. I am beginning to worry about making the deadline for the student poster session at SC08. The deadline is July 31st (in two weeks).

I spent the week trying to get some things implemented that will give us some results for a possible paper. We've made a rough outline of what a paper might look like and identified some simple goals to serve as a guide in terms of how far along the implementation of various policies need to be at this point.

In addition to the policies we'd like to implement, we also plan on running some tests to get a rough idea of the scalability and performance of the Drools rule engine.

To give you some background, a rule engine is a program that is used to encode certain knowledge in the form of rules. These rules generally take the form of &ldquo If A is true, then do B. &rdquo The engine uses these rules to make decisions given a certain set of user-provided facts that represent information about a scenario to which the rules may be applied. One typical example that is often provided of how a rule engine might be used is as a medical diagnostic tool. In this case, doctors would encode their knowledge as rules and provide these to the rule engine. These are then used in conjunction with information about a patient's symptoms to make aid in making a diagnosis. For example, there may be a rule that says, &ldquo If the patient has a sore, irritated throat and a fever, then give them a strep test. &rdquo And another rule might say, &ldquo If the patient's strep test is positive, then diagnose them with strep and provide them such-and-such medication. &rdquo

Our simple performance test involved creating 1000 rules that take the form, &ldquo If a number has a value equal to x, then increment the count of matched numbers, &rdquo where x is some nonnegative integer value. We then generate an increasingly large number of facts that each represent some number with a randomly chosen value. The purpose of this test is just to get a rough idea of the scalability of the rules engine in terms of numbers of facts, but obviously this is just one factor to look at. Others that are likely more important include the number of rules and the complexity of these rules. There are benchmarks that exist specifically to test rule engines, and one thing I'd like to do is test out Drools on those. Surprisingly, despite how simple our tests seem, it took a bit longer than you would think to settle on the idea for them. I worried about it seeming a bit arbitrary and irrelevant, but time spent worrying about that is time that needs to be spent doing something else. So, time is of the essence, and I don't know how valid our tests are, but they are good enough for now.
week=7
I'm not really even sure what to say that I did this week. It was just one of those weeks where you feel like you're running in place. I hate that. But I guess the real achievement was working out many of those details I was worried about last week and coming up with a game plan, i.e. specific tests that I will try to run and specific things I will write about in my paper. I owe Dr. Chervenak for helping me to get back on track.

One activity that consumed a lot of my time this week was playing with my new laptop. I got a Macbook. It is a lot of fun to play with and now that I think about it, that is probably where many of my hours went. I wanted to get things set up how I want them, though, and there are still more things I'd like to do with it. But at least I have it to the point where I can start getting back to work on my actual work.

I also spent a bit of time trying to persuade the Culver City Police Department to sent me a copy of the police report, but it always seems that the person I supposedly need to talk to, which changes from day to day, just happens to be out of the office.

I intended to make these journal entries solely about the research I am doing, so I apologize for my digressions. I justify it by saying this is one more way in which crime hurts society: it hinders the progress of research! Yes, because my work is of vital importance... In all seriousness though, I am a bit miffed by the fact that not only was my laptop stolen from me, but so was a vast amount of my time, from watching hours of security footage, to telling my story over and over again to various people, to playing phone tag with the police department, to getting sucked into the marketing vortex that is Apple in choosing my new laptop, etc. I am having a hard enough time trying to be productive every day without a laptop thief stealing away my best hours. And insurance doesn't cover that kind of theft (although it does cover my laptop, which is nice).

I did get a free iTouch. That helps take the sting out a bit.
week=6
This week was short because the 4th of July fell on a Friday this year, so we had a three-and-a-half day weekend which started on Thursday, and as a consequence I don't feel that I accomplished very much. Oh, also our apartment was broken into and my laptop was stolen this weekend, so that has slowed the speed of progress somewhat.

I basically spent the week trying to think of other possible policies to implement. Initially, I was having trouble imagining what kind of policies a VO might have pertaining to the management of their data, which I think I can attribute to a lack of hands-on experience which the ways in which scientific organizations that form a VO use a Grid. Luckily, inspiration came to me, and I have come up with a list of about a dozen categories of policies that a VO might need or wish to enforce, although a few of these are not strictly related to data management. This inspiration came as I was taking a mental break from work and looking up information about various professors at UVA to try and figure out who to ask to be my senior thesis advisor this year. Much to my surprise and joy, I discovered that Dr. Marty Humphrey is currently working on issues related to policy implementation in Grid computing, like me! So I think he will be the first person I try to persuade to be my thesis advisor (but shh, don't tell, because I haven't done it yet). Reading a few of his papers introduced me to a different way of approaching and defining the problem, as well as ideas for alternate implementations. This helped open my mind up to new avenues of thought, as I'd kind of been feeling as I had hit a mental barrier before.

Now my problem is that I've become a bit confused since I've seen that a great many different formulations of the problem and implementations of the solution are possible. My next step, I suppose, is to try and unconfuse myself by narrowing down the types of policies to enforce and features to provide. I am trying to plan how to make my application flexible enough to handle a broader range of policies, which basically means determining all possible information my application may need and how it will obtain that, and how to formulate rules that will handle this wider range of input and cover the larger number of possible cases. The addition of functionality could change the way that we've been imagining this application should work, that is something that needs much more thought. I've come to think that it is one thing to have a vague, conceptual (or "high level") idea of what this application should do, and an altogether different thing to work out the details of how to do it.

So, this next week, I need to specify the devilish details of what needs to be done and then figure out how to do it. One way that I've been making things more difficult for myself is by trying to think of ways that we will "test" this application in order to have some results to write about in a paper. I think this will end up boiling down to coming up with many different possible use cases for my application, as was suggested to me by Rob Schuler (a developer at the ISI), which is just another way of saying that I need to figure out the details of what my application should do, as I've already observed above. Trying to think of ways to "test" my application, as well as a lot of use cases, is starting to make me feel as if I'm hitting another mental block, so that really must be worked out this upcoming week.
week=5
This week I finished up deploying a few more services from the Globus Toolkit on my local machine, which turned out to be unnecessary, as far as I can tell at this point in time. So, then I started work on my application. It successfully queries the Replica Location Service (RLS) and executes a simple n-copy replication rule based on some very simple test data that I feed it, but the API I was trying to use to perform the actual transfer of files won't work for me consistently, for some unknown reason (it could be a bug or else I'm doing it wrong...). So, that part is not quite worked out yet, but it should be relatively straightforward to use a different API for GridFTP, and complete this initial iteration of the application.

The next step is deciding what policies I might try to implement and how we are going to evaluate them. Also, a few things that have been kind of worrying me is that the rule engine is pretty slow (I am told rule engines, in general, are not known for speed) and the particular open source rule engine I am using seems to have some pretty serious memory constraints, from what I am hearing on the user mailing list grapevine. The speed factor should not be such a problem since we are imagining the work done by the application to be more of a background or maintenance task run as a (cron job, perhaps), but the memory factor could be a problem. There are possible workarounds, but we'll see. I don't think this is necessarily a typical problem for rules engines, but is, instead, particular to the particular open source version that I am using. There is a commercial version of this problem and for some reason I doubt it has this problem.
week=4
It seems my hunch that the Globus installation process would be unpleasant was right on the mark. I spent pretty much this entire week trying to set up the Globus Toolkit on my computer. It was supposed to have only taken a day, but I spent a lot of time trying to overcome roadblocks (such as trying to configure mySQL with unixODBC - blech) only to find out that I wasn't even supposed to be going down that road in the first place! So I was frustrated from feeling that many entire days were wasted, but actually it is a good learning experience. I am learning quite a bit more about Linux (the machine they gave me at work runs Fedora 8), filesystems, various well-known tools like xinitd and shell scripting! I'm also improving my command line and vim (yes, not emacs) repertoire, and I've learned to make user mailing lists and discussion forums my best friends, as it turns out, unsurprisingly, that a lot of people run into the same problems I do. So, that is the bright side. But, as my advisor pointed out, I am really here for the experience of doing research, so it'd be nice to move on. All this stuff is just sort of tedious "hocus pocus", as she put it.

I agree that I've spent more than enough time on this and I hope I have to spend considerably less time now on installation/configuration. But, my thought is that I have to get all of this set up first before I start coding my application because I need to be able to make sure my application can talk to the Globus Toolkit services and learn how these services work before I start churning out a bunch of untestable code. Because without knowing how the services operate and being able to verify that my application is using them correctly, I would most likely end up with a bunch of broken code and have to scrap it all anyway, and I'd still have to do all the installation at some point. Also, once I get this out of the way, I think it will take much less time to get an initial iteration of my application up and running, from what I've read. Next week, I really will start trying to write my application and filling in details such as what exactly my application should do, what problems we are trying to address and what problems are we not, what policies we should implement, what assumptions we will make, what our motivations are, and so on.
week=3
This week I am finally starting to look at implementing something using Drools, an open source rules-engine that is part of the JBoss collection of middleware applications. We have also identified two possible conferences for me to submit to, ideally a paper, but if not, a poster. I am pretty excited about that, but looking at these deadlines makes me realize how quickly the little time I have goes by. I've heard from a number of sources that going to big conferences, especially in the area that you might want to work in, is very important for making contacts, so I feel not an insignificant amount of pressure to do something good enough to be accepted to one of these.

The first major task for me to finish, once I figure out how Drools works, is to figure out how to make my application interface with some of the Globus Toolkit services that I'll be using along with the rules-engine. This also means I have to install Globus, which sounds unpleasant...
week=2
This week involved a lot more reading, but the focus of it has shifted from a general overview of grid computing and the work done by Dr. Chervenak and Dr. Deelman in the past (Tina's mentor) to more current research and topics that more specifically pertain to what we will be working on this summer. Dr. Chervenak has been very helpful in pinpointing an area of research that I can focus on that will be interesting but at the same time small enough that I can have some finished products to show at the end of the summer i.e. a paper. A USC graduate student in Dr. Chervenak's group, Amer, who kindly shares his office with Tina and I, has also been exceedingly helpful and generous in offering to collaborate in some way. There are a lot of details to be ironed out, but I am very excited about the topic we've decided to focus on.

I am really enjoying all of the reading (although it's hard to keep everything straight and manage to absorb it all at once), but I also look forward to actually starting to implement something and running some experiments. There is work to do in precisely defining the problem, determining what sort of assumptions we will make, and deciding how we will measure the goodness of our final product, but I am finding it to be a lot of fun. To be honest, the amount of enjoyment I am getting from doing this work pleasantly surprises, relieves, encourages, and motivates me. Recently, I had been starting to second-guess myself and worry that I wasn't cut out for grad school. I had been afraid that I wouldn't enjoy this research business at all and was starting to wonder if I was really making the right choice about what to try to do in the future, but I'm much more confident now that I'm going in the right direction, at least in terms of what I like to do. But whether other people will think I'm right for it is another matter...Ah well, they say you shouldn't let fear of rejection stop you. I like to think I have a pretty sweet backup plan anyway.

In addition to all of the learning and researching and other things that I have to do for my work this summer, I need to start thinking about what I want to do next year. There are scholarships to apply to and schools to research and websites to update and thesis topics to pick and graduate exams to take and Father's Day gifts to buy and budgets to balance and more that I'm sure I'm forgetting... This upcoming year (starting now) could quickly become overwhelming, so I'm going to have to try to stay as disciplined as I can personally manage. It's a good thing to learn to do but I don't think it will be so easy. We'll see how it goes. There are also a lot of little other things I'd like to accomplish, like learning Perl, figuring out what kind of laptop to get next year, installing Linux on my current stupid laptop, playing around with some virtualization tools, etc., but those are pretty low priority. Which is too bad, because they seem fun.
week=1
This week was busy. I moved into the apartment that I am living in this summer with another student, Tina Hoffa, who is also participating in the DMP at ISI. We met with our mentors and discussed some goals for this summer. They pointed us to a great deal of reading on various aspects of Grid computing and technologies, workflow generation, data management, and more.

My (as of yet) untested strategy for dealing with all of this reading is to collect all of the paper copies they give us into one binder, and as I read, highlight and take notes, and then include the notes I took along with the paper in the binder once I've finished. It seems like a good idea, conceptually, but I think I'd be better off if I transferred all of my notes and things to electronic format, so I've been looking into some ways to manage that. I'd also like to have all of the citation information for each paper together with the notes for easy access in the future when I have to write a paper about all of this. I've been exploring a few different options but I've been so busy and taking care of all that is pretty far down on my to-do list, so it probably won't happen soon.

After getting set up in our new office, we basically spent the whole week reading. We also met the other members and students on their respective research teams, as well as some of the ISI staff. Everyone seems very kind and helpful and I expect to learn a great deal from many of them, as well as partake in fruitful collaboration.

current as of July 2008