Bayle Shanks's website: tips-programming

programming stuff:

--- tutorial notes: Using Perl for text manipulation

terms to remember:

"regular expressions" are a grammar for textual pattern matching
Tries to match expression starting at each loc in the string
To parse data: (1) think of how you would represent how the input looks in terms of a regular expression (2) use parentheses to surround parts of the regexp and put them into $1,$2,etc variables
"Kleene star" = '*' used to eat "0 or more" repetitions
when you have repeated elements, you almost always use + or * rather than manually repeating them in the regexp.
at each attempted match at a string location, can think of + and * "eating" a bunch of characters in the string
To output parsed data:

   print "A string like this with fields in it; field 1: $1, field 2:$2";

man pages: "man perlre" "man perlrequick"
(unrelated Perl tip) The commands:

        use Data::Dumper;
        print Dumper(@list);

will print out an entire list for you. Very useful in debugging.

!/usr/bin/perl

use Data::Dumper;

$filename = "testD";

    open(IN, $filename);
    while ($line = <IN>)
    {       
	chomp($line);
	push (@in, $line);

print Dumper(@in)."-- next iteration\n"; } close IN;

foreach $item (@in) { $regexp = '(\d)'; $item =~ /(\d+) (\d+) (\d+)/; $newitem = "$2 $3"; push(@out,$newitem); }

    open(OUT, '>'."testout");
    foreach my $item (@out)
    {
        print OUT "$item\n";
	print "$item\n";
    }
    close OUT;

rough notes from mystuff/notes/libraries.txt; write this up

How does a program or "system" handle reuse of code, and handle one person using another person's code?

Say that you want to do something "common". Say you want to search through a string for the occurance of some substring. And say the language in which you've written your program doesn't provide a primitive for doing this.

[note: the "issues" are rather biased because they were constructed after I knew what the next solutions would be; the issues of one solution are really just the features of the next one]

Solution 0: write the code yourself

We could have everyone write all the code for everything they want to do. The problem is that everyone would be constantly "reinventing the wheel". More specific problems (or rather, issues that can be improved):

1) Waste of coding time. Someone else has already written code to search through a string, why should you redo it? 2) Quality. If someone else has been using and debugging their string-searching code for 15 years, and you reimplement that feature, you will likely have some bugs and will make some "classic mistakes" that others have already suffered through. (well, okay, maybe not with string-searching, but imagine something more complicated) 3) Code readability. Anyone else who debugs your code will have to take a small amount of time to learn what the function "my_string_search" does and what arguments it takes and in what order.

Solution 1: get the code from a friend

You could ask all your friends, search the web, and post a request on the usenet for code to search through a string. This solves issue #1, above (waste of coding time). However, problems #2 and #3 are only partially solved, and there is a new issue #4:

2) Quality: although the quality problem as stated above has been solved, what if someone discovers a bug (and a solution) a year from now in that string-searching code that they just gave you? Perhaps they'll email you, and you'll have to manually modify your own code. Perhaps they'll forget about you, and the bug will remain undetected in your code. Let's make some subissues to take note of this scenario. We'd like mechanisms to: 2a) Let you know when a bug (and a fix) has been discovered in code that you got from someone else. 2b) Automatically paste the fixed code into your own program 3) In the special case where everyone looking for string search code happens upon the same "Best String Search Code Ever" website and gets their code from there, others will be familiar with the "best_string_search" subroutine calls that they find in your code. But more likely you'll end up with "joe_schmoes_string_search" that someone gave you on Usenet, and others will still have to expend effort understanding your code. 4) (new major issue) Finding good code from others is a hit-or-miss business. Which web sites do you look at? What search terms do you use? What if really good string search code is publically available and you don't hear about it? If both "joes_string_search" and "suzys_string_search" are available, along with 50 others, how do you decide which one to use without spending hours comparing them and looking for others' opinions, and without just randomly picking one? 5) (new major issue) You find a great string search, but it is written in Fortran and you are programming in C. Shame that you can't use it.

Solution 2: libraries (as source code)

Libraries are collections of functions which can be used by multiple functions. Instead of just posting a Usenet comment containing some code for "joe_schmoes_string_search", joe_schmoe writes a library source code file, which contains the same code, with maybe some extra "library" syntax stuff depending on the programming language. You download the library source file as "joe_schmoe_lib_v1" on your hard drive, and then compile that file with your program.

Now if Joe finds a new bug and fixes it, he puts up a new library file, and you just replace your old one with the updated version. Slightly easier than cutting and pasting the new code from a Usenet post. Issue 2b is solved, the others remain. And here's new issues (for these you have to once again imagine that joes_string_search is a huge file):

6) (new major issue) You write two or three different programs, all of which use joes_string_search. You have to recompile the library for each of these programs. The library is really big, so it takes awhile to compile. Why should you have to compile it three different times? In fact, since your friend Dan also uses the same library on the same operating system, and he already compiled it, why can't you just get the object code from Dan?

7) (new major issue) In addition, you don't know it, but you have on your system 10 programs by others which use the same code. Seems like a waste of space to have the object code produced by joes_string_search duplicated 13 different times on your disk (and in memory, if you run all 13 of these programs at the same time).

Solution 3: object code of libraries

To solve issue (5), compilers implement a function where they can compile the libraries to object code, and then you can just download the object code from Joe instead of the source. Then, after you compile your program, you "link" it to the library object code to produce the final result.

Now, you never have to compile the library yourself because what you downloaded was a precompiled version.

Solution 4: dynamic linking

This is something implemented by operating systems to solve issue (6). Instead of you keeping the library in your own folders and telling the "linker" where to find it after you compile, you just have some sort of identification for the library (say, a string name like "joes_string_search_library" along with a version number like "1.0"), and the operating system keeps track of where the library actually is.

What happens is this. When you compile and link your program, you don't tell it "link with file /home/me/programming/project3/joes_string_search_library". You tell it "link with dynamic library joes_string_search_library version 1.0". The program is not actually linked when you make it. Instead, the program is linked when you run it. When you run it, the operating system sees that your program is requesting library joes_string_search_library version 1.0, and finds that library.

Now, all 3 of your applications and all 10 of the other ones don't include the actual object code for joes_string_search_library; all they have is a request for "joes_string_search_library version 1.0". Now there only needs to be one copy of the actual library on your machine.

On GNU/Linux, these "shared libraries" are stored in places like /lib and /usr/lib. You can use the command "ldd" to see which libraries are requested by a given program file. On windows, these libraries are generally called "DLLs", and have the .dll extension.

This adds a new problem:

8) (new major issue) If you write a program using joes_string_search_library, your users won't be able to run that program unless they also have joes_string_search_library installed on their system. This might be confusing for end-users.

Let's recap before going on.

Issue 1: Waste of coding time; solved. You use other people's code instead of writing everything yourself. Issue 2: Quality; partially solved. By using other people's code, you avoid remaking their old mistakes. Issue 2a: Not solved yet. If a new bug is found in code you're using, you might not hear about it. Issue 2b: Solved. When someone has a new version of code that you're using, you can update a file instead of cutting and pasting into your code. Issue 3: Readability; partially solved. To the extent that you use libraries that are popular, other people will be able to understand much of your code. Issue 4: Finding code; not solved yet. Issue 5: Cross-language reuse of code; not solved yet. Issue 6: Recompiling common code; solved. Issue 7: Duplicating common code in disk and memory; solved. Issue 8: Need to distribute shared libraries to users; not solved yet.

So, we're left with:

Issue 2a: We'd like to know when a new version of common code has been developed. Issue 3: We'd like our code to be as readable as possible. Issue 4: We'd like a good way to find common code. Issue 5: We'd like to reuse code across languages. Issue 8: Need to distribute shared libraries to users; not solved yet.

To continue,

Solution 5: Distribution structures

If Joe (and others) think of his string search code as just something which he used in some program and then posted to Usenet once, future bugfixes and improvements to the code may not be distributed to others using that code. But if he thinks of it as an entity unto itself, tied to an ongoing project to develop and distribute it, things change. If people think of that string search code as a whole project, then they might put up a website for it, or put it on sourceforge, or do something similar which allows others to check for new versions and download them.

This solves issue 2a; now at least there is a way to check if a new version of a library has been developed (go to the project web site). Or, subscribe to the email list for projects whose code you are using.

Solution 6: Packaging systems

It is still a pain for users to install programs which use shared libraries. In addition to installing the programs themselves, they must tell their O.S. to install the shared library. The obvious solution is to have some standard installation method that installs both a program and its libraries in one go.

On Windows, installing a program generally means installing both the program itself and any needed libraries, transparently to the user (InstallShield? does this, for example). Needed libraries are included with each program.

On GNU/Linux, distributions with packaging system (like RedHat? and RPMs, or Debian and apt-get) were developed. The distributions provide centralized depots where any needed libraries may be found.

For example, on Debian, if I type "apt-get mutt", Debian installs not only the program "mutt" itself, but it also downloads and installed any libraries required by "mutt".

Now users don't need to know much about shared libraries (in theory).

In addition, packaging systems and dynamic libraries can do another cool thing. Say that a bug is discovered and fixed in a frequently used library. End users don't have to wait until each of their programs is recompiled and made available with the fixed library. They just have to replace the library themselves. Now, this is too complicated for end-users, but packaging systems can do this automatically. In Debian, for example, you can use programs like "aptitude" to download off the internet a list of all libraries and programs for which new versions are available, and then upgrade all of them automatically.

So issue 8, distribution of libraries to users, has been solved (and then some) (although actually many GNU/Linux package management systems are still too complicated for end users).

Solution 7: Cross-language bindings

This is targeting issue 5, reuse of code across languages.

Many languages allow programs written in that language to call and to be called by certain other languages. For instance, Perl has mechanisms to call C functions and to be called from C. It so happens that C has become sort of the "lingua franca" for this kind of interaction; at least in the Unix/GNU Linux world, many languages have ways to call and to be called from C.

The downside is that for most languages, some sort of "glue code" must be written for each program in that language from a C program or vice versa.

I'd say that issue 5 has been "solved" in that it is usually theoretically possible to reuse code across languages, but imperfectly.

Solution 8: Component architectures

Component architectures like COM and .NET attempt to provide a language-neutral interface between caller and callee. That is, instead of calling the callee function directly, you tell COM to call it for you, and COM handles any glue.

(refs to write more: http://www.advogato.org/person/mikehearn/diary.html?start=11, "COM and the COMpetition")

Solution 9: Library repositories

This is a step towards solving Issue 4 (finding common code). Repositories have been set up with lists of available libraries in various languages (Perl has CPAN, for example). These repositories have categorization systems to allow you to find a library that deals with the kinds of problems that you are trying to solve.

However, many repositories (such as CPAN) don't make it easy to rank libraries, or to see which ones are most popular or refined. Since there are often many different libraries which solve similar problems, it can take hours to decide which one you want to use.

Solution 10: Peer reviewed repositories

For instance, boost.org, for C++, peer reviews submitting libraries before inclusion. Something I would like to see is 1) Voting on other repositories, like CPAN 2) A site like "canonicaltomes.org" for libraries, which would quickly answer the question, "what are the top 1-5 canonical libraries for solving a certain sort of problem". 3) Library repositories focused on the problem being solved, rather than the language of the library. Some problems (like integrating differential equations) are generic enough (or esoteric enough) that it might be just as easy for you to use, say, a C library even if you are programming in Perl. Or, for example, perhaps you want to do some research with neural networks, and are interested in finding the "canonical library(s)" used by today's neural network researchers.

Solution 11: Standardized libraries

For example, the STL in C++, or many libraries in Java. A standardization process eventually designates a single library to fulfill certain functions. Now programmers don't have to spend time choosing which library to use for that function. In addition to making it easier to find libraries, standardization and peer review fulfill another function which is just as important; they make code more readable by creating a standard vocabulary shared by a large number of people. [write more on this later]

I would like to see:

1) more accessible processes for changing canonical libraries/de facto standards into recognized standards. That is, if there is a single library that everyone is using for differential equation operations in C++, there should be a non-painful way for that to be recognized and highlighted (i.e. without someone spending a great deal of time and effort to get it through some committee). See suggestions in soln. 10 for more on this.

2) Library "standards" than are flexible enough to highlight two or three competeing libraries rather than just 1. If there are two libraries for dealing with matrices in C++, each with their own syntax which is convenient in different ways, I don't see the need to pick one and marginalize the other.

3) More interaction between libraries and a more conscious push towards standardized libraries. For example, if there are 20 regular expression libraries for some language, they should be looking at each other and absorbing each other's functions and strengths. Maybe library A would eventually become a superset of library B, in which case the library B developer might say, "OK, my work here is done, I suggest all of my clients use library A from now on".

misc: functions

Here's a new issue:

8) What if program A was compiled with version 1.0 of joes_string_search_library and program B was compiled with version 2.0? What if the way you call the commands in version 1.0 and 2.0 have been changed (e.g. if some of the functions have different names in 2.0 than they used to in 1.0?)

c++ libraries

http://www.boost.org/

--- http://www.cetus-links.org/

--- a c++ problem i had once:

pass-by-reference vs. pass-by-value:

I had three objects, A, B, and Z. Both A and B had an instance variable which was supposed to point to the same Z. First, A created Z, then A created B, and it passed Z to B.

But I accidentally passed by value instead of passing by reference. Then, when I called some methods on Z, it would mysteriously crash. I finally found out that that was b/c Z was not in the state that I thought it was. When A passed Z to B, I assumed that it passed by reference, not value, since I knew that all A had in its instance variable anyway was a reference to a separate object Z. But in fact, it had transparently make a copy of Z, and passed a reference to that copy, Z', into B.

So the program was crashing because I set up the object Z with A, and then I tried to use it from B -- in fact, though, B had its own Z' which hadn't been made ready for what I was trying to do with it.