maltanar's scribbles: gsoc

Showing posts with label gsoc. Show all posts

Monday, August 16, 2010

Finaly Weekly Report, no #13

So here comes the final weekly report... I just want to say once again what a wonderful experience it has been to work with the BeagleBoard for GSoC! As I stated in my previous week's report, I do believe I've managed to complete most of what I (and my mentors) had in mind for the end of the GSoC period. There's (as always!) more to do, of course... and I intend to continue development for C6Run in specific and BeagleBoard in general (though with the start of the academic year, it's unlikely that I'll be able to devote as much time), though right now I could really use a little break, and will be taking one :)

Weekly Report #13
Submitted on 2010-08-16
Covers 2010-08-09 to 2010-08-16

Status and Accomplishments
• documentation moved to its own eLinux wiki page (at http://elinux.org/BeagleBoard/GSoC/2010_Projects/C6Run/Documentation), divided into two sections (Usage, Architecture) and expanded
• build system modifications: support auto-copying of the generated stub files by the stub generator utility, make a folder for user sources
• code cleanup: move GPP&DSP common definitions to a single header file, work on GPP-side pointer alignment problems, partition GPP-side code into modules

Wednesday, August 11, 2010

Documentation's moved

Just writing to let my readers know that the C6Run documentation has moved to the elinux wiki and will be residing there from now on:

http://elinux.org/BeagleBoard/GSoC/2010_Projects/C6Run/Documentation

I've been working on expanding and improving it yesterday, and there's more to come.

Monday, August 9, 2010

Weekly Report #12

Weekly Report #12
Submitted on 2010-08-09
Covers 2010-08-02 to 2010-08-09

Reflections
As we're past the soft deadline (aka "the suggested pencils down date" in GSoC speak) I wanted to write a bit about how DSP-RPC-POSIX has progressed so far. I'd say that the project is now in a state that can do most of the things that me (and my mentors) had imagined it would be doing by now. C6Run's own goals of easy DSP app prototyping with console/file system access are pretty much accomplished, as is being easy to get started with (although there are debatable issues surrounding this particular goal, see below). DSP-RPX-POSIX's twofold goals, which were basically being able to call GPP side functions from the DSP, and providing a prebuilt set of POSIX (actually, "standard C Library" would be a much appropriate term) remote accessor functions, are also met. Right now, a sample DSP development cycle for a beginner developer utilizing the aid of C6Run could look like:

• obtaining the sources from SVN
• setting up (which includes a script that downloads and sets up the C6Run dependencies) and building
• using C6RunApp and DSP-RPC-POSIX to construct, test and debug the wanted DSP-side algorithm
• using C6RunLib to wrap up the finalized DSP-side algorithm with an ARM library and use this library to build DSP-aided GPP apps

The specific capabilities and properties of DSP-RPC-POSIX include:

• ability to call most C standard library functions on the GPP side without any extra effort (stubs are already generated)
• ability to call any GPP-side function, with support for all basic parameter types and buffers and obtain the return value, provided that stubs for the target function exist
• a stub generation tool to easily generate stubs for GPP-side functions (given a number of C source files, it will generate the stubs necessary for RPC access to all the functions contained)
• function signature system allows addition of new data types for parameters/return types with special treatment
• architecturally, the RPC layer is mostly seperate from C6Run's other functionality so it could be taken apart and used in other projects

There are, of course, certain shortcomings:

• C6Run still does not build as part of OE. the build system has been modified to skip rebuilding existing dependencies, but there are still problems building for the Beagle. once these are fixed it won't take long to have a OE recipe in place. to my knowledge, there are folks at TI which are already working on this.
• pointers/buffers issues: double/triple/etc. pointers, pointers inside structs, pointers hidden inside buffers (ie, any pointer that is not explicitly declared as a parameter) can't be handled automatically and need to be manually address-translated by the user.
• structs can't be passed as parameters since a different signature character would be needed for each, they need to be passed as pointers to structs instead
• pointers/buffers to be passed via RPC need to be either less than a fixed number of bytes, or be allocated using rpc_malloc instead of from the DSP heap or stack

Status and Accomplishments
• DSP-side caches are re-enabled for all platforms, stubs that use pointer/buffer parameters call the needed writeback/invalidate functions when needed to keep cache coherence
• ARM-side cache coherence code and cache coherency testing example in place
• stubs for string.h functions and POSIX low level I/O functions

Plans and Tasks
• what Google suggests after the soft pencils down date - code scrubbing (mainly fixing possible memory alignment issues, moving commonly used things to GPP/DSP shared header files etc.), improving documentation and writing more tests/examples

Risks, issues, blockers
• for some reason (probably due to the absence of a BCACHE_wbAll function on DSP-side?) enabling ARM caches still results in cache coherency problems.
• the C6Run trunk still produces problematic executables for the BeagleBoard, so a merge with the dsp-rpc-posix branch doesn't make sense at this point, which means OE integration won't make it to the GSoC deadline

Wednesday, August 4, 2010

Cache Coherency

The presence and usage of ARM and DSP side caches poses a cache coherency problem when we want to access a shared area by both processors. Let's consider the following scenario:

1. A CMEM buffer is allocated for shared usage by the GPP side and its physical pointer is passed to the DSP.
2. The DSP wants to read and then write some data into this buffer. Let's say that there are free entry slots in the DSP L2 cache - so the data actually gets written to the DSP cache, instead of making it to the DDR shared region. The DSP then signals the GPP that it's done with the buffer for the time being.
3. The GPP attempts to read the buffer, but what it reads is just the garbage values present in the buffer after initialization since the DSP-written data is in the DSP cache. This buffer also gets cached in the ARM-side now, so when the GPP tries to write some new data into it, it stays into the ARM cache and doesn't make it to the main memory either.
4. If the DSP tries to read the cache now, it won't get what the ARM has written into it most recently since it'll be reading from its own cache, and vice-versa for the ARM side.

We can see that the "same" buffer actually exists in three different locations (main memory, DSP cache, ARM cache), all of which can contain totally different data - in this case it is said that they are not coherent, and that we have a cache coherency problem.

In most cached systems, there are cache coherency protocols which prevent these situations from occuring. The TMS320C64x+ DSP Cache User's Guide states:

In the following cases, it is your responsibility to maintain cache coherence:
â€¢ DMA or other external entity writes data or code to external memory that is then read by the CPU
â€¢ CPU writes data to external memory that is then read by DMA or another external entity

thus we have to manually maintain cache coherence for mutual access to CMEM regions by the DSP and the GPP. Studying the scenario above, we can observe that there are two underlying problems:

1. If the memory block to be read already exists in the local cache, there's a risk that the local cache is outdated: we need to discard the local cache entries and fill them up with information from the main memory. This process is called cache invalidation.
2. When the memory block is to be written into, there's a risk that the info remains in the local cache and doesn't make it to the main memory: we have to make sure that the new info gets written to the main memory as well. This process is called cache writeback.

Therefore, from a RPC perspective, for a call that involves transferring buffers, the steps we have to take are as follows:

1. Before passing the marshalled info via DSP/Link, the DSP must do a cache writeback
2. Before passing the params to the GPP side stub, the GPP must do a cache invalidate
3. After the GPP side stub is finished, the GPP must do a cache writeback
4. The DSP side stub must do a cache invalidate before terminating

This is assuming that both processor caches will be active - in case the DSP cache is disabled, the steps 1 and 4 will not be necessary, and likewise with steps 2 and 3 for a disabled GPP cache.

Monday, August 2, 2010

Weekly Report #11

Weekly Report #11
Submitted on 2010-08-02
Covers 2010-07-26 to 2010-08-02

Status and Accomplishments
• an elementary version of the mechanism to be able to pass DSP-allocated (stack or heap) buffers to RPC functions is now in place. what essentially happens is that when the GPP side detects a non-CMEM-allocated buffer is passed as a parameter, it uses PROC_read to read from the DSP memory and copy this into a regular GPP buffer which the functions can access.
• bugfixes for RPC functions returning structures - but pointer parameters inside structs still remain an issue and have to be translated manually
• dsp-rpc-posix branch re-synced with c6run trunk, but to no avail (see blockers section)

Plans and Tasks
• the ARM and DSP caches are both disabled for dsp-rpc-posix to deal with cache coherency, but this impacts the memory access performance quite negatively - enable them again and use writeback/invalidate functions to keep caches in sync
• some POSIX functions are still missing from the RPC library since their corresponding header files (with type definitions and all) don't exist for the DSP side. this includes useful things like ioctls - find a way to expose these through RPC

Risks, issues, blockers
• despite all the time I spent on it, I couldn't uncover the cause of the "bus error" that occurs when I compile even the simplest things (such as hello world) with the latest C6Run trunk (so it's not just my synced branch that's troublesome). in some cases it's just "bus error" and the program stopping, in others the PROC_setup fails. the code responsible for these parts hasn't really changed so I'm suspecting it to be a build/configuration issue. due to this and the fact that my primary computer's hard drive crashed (I'm stuck with a silly little netbook!) I haven't been able to try and build C6Run inside OE. I'm not sure if this is just happening for me or the c6run trunk is broken (there are indeed some errors that prevent it from building but they are small and easy to fix)

Monday, July 26, 2010

Weekly Report #10

Weekly Report #10
Submitted on 2010-07-26
Covers 2010-07-19 to 2010-07-26

Status and Accomplishments
• the stub generation script (c6runapp-rpcgen) is now in place; provided with a number of C source files it can generate the corresponding GPP and DSP stubs for the functions defined in the given files, thus exposing them via RPC
• the documentation is expanded to include architectural details and will be maintained in the project wiki pages here
• more RPC examples to cover newly added things like stdio.h variadics and functions returning structures
• synchronized branch with the latest C6Run trunk, whose modified build system can create the C6Run libs without having to re-build the dependencies every time - but there are problems

Plans and Tasks
• investigate why the trunk-synced branch errors out on the produced executables ("Bus error", including on the non-RPC examples) and fix this
• experiment with building a bitbake recipe for C6Run to get it built inside OE
• for predefined RPC stubs whose pointer parameters are only "in" (ie, the RPC target won't modify the contents of the memory pointed) implement a mechanism that will allow these parameters to be alloc'd from the DSP stack or heap - having to allocate every little string via rpc_malloc is annoying

Risks, issues, blockers
• it's not clear why the bus errors occur with the latest trunk-synced version but hopefully it'll be something that I overlooked while merging the changes from the trunk, or possibly a problem with the version of the trunk I used

Monday, July 19, 2010

Forwarding invocation of variadic function in C

I had been brooding over how to do the RPC calls for variadic functions for some time now. Although marshalling any given variadic isn't really possible due to a lack of general method for obtaining argument count and sizes (see my weekly report #9, issues section), for commonly used stdio.h variadics such as printf and scanf, the arguments are well-defined by the format string so it should be possible to manually marshal these.

I'll be writing a seperate blog post about how I went about doing that, but for now I want to talk about a sub-problem of this: given a number of arguments, how do you forward these to a variadic function? A re-statement could be, how to "dynamically" invoke a variadic function?

Looking this up in the net, I've found these two discussions to be the most relevant:

http://stackoverflow.com/questions/150543/forward-an-invocation-of-a-variadic-function-in-c

http://stackoverflow.com/questions/1721655/passing-parameters-dynamically-to-variadic-functions

The second link mentions a library called FFCALL which can be used to pass parameters to variadics dynamically, and this probably is the ideal way of doing things.

I may have found another method for this - so far as I've seen it works on x86 and ARM. It's based on the assumption that the last mandatory parameter and all the variadic parameters reside continuously in the memory, as well as lots of terrible coding practices, but it should be illustrative enough.

What I'm doing is basically copying a fixed number of bytes from the memory region (=stack) where the variadic parameters are located into a buffer, then passing this buffer to another variadic function which is called with a fixed number of arguments (I picked doubles since they are larger and will allocate more space on the called function's stack). The function then calls memcpy to overwrite its variadic args section with the passed buffer, and afterwards can call the stdarg macros to obtain the variadic arguments, or pass them to something like vprintf.

Here's the code:

#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>

void accepter(char *fmt, char *ptr, ...);

void forwarder(char *fmt, ...);

void forwarder(char *fmt, ...)

{

double d = 9.1;

char *buf = (char*) malloc(10*sizeof(double));

memcpy(buf, (void*)((unsigned int)(&fmt)+sizeof(char*)), 80);

FILE *o = fopen("tmp", "wb");

fwrite(buf, 10, sizeof(double), o);

fclose(o);

// call function with 10 double arguments to open up stack space

accepter(fmt, buf, d, d, d, d, d, d, d, d, d, d);

free(buf);

}

void accepter(char *fmt, char *ptr, ...)

{

memcpy((void*)((unsigned int)(&ptr)+sizeof(char*)), ptr, 10*sizeof(double));

va_list ap;

va_start(ap, ptr);

vprintf(fmt, ap);

va_end(ap);

}

int main()

{

printf("Testing...\n");

double d = 65.98;

forwarder("%d %d %d ermm %s and more params! %x %f %x %x \n", 1, 2, 3, "hello world", 199, d , 0xdeadbeef, 0xbeefdead);

return 0;

}

Weekly Report #9

Weekly Report #9
Submitted on 2010-07-19
Covers 2010-07-12 to 2010-07-19

Status and Accomplishments
• support for returning structures by the addition of a special return type signature ('#')
• address translations are now exposed through RPC, so the user is able to perform translations manually
• build system modifications - DSP and GPP-side stubs are now collected from inside two directories instead of two single files
• dsp-rpc-posix branch updated to contain the latest changes in the C6Run trunk
• manual marshalling for stdio.h variadics (printf, scanf, fprintf, sprintf...)

Plans and Tasks
• implement scripted stub generation (ie, the user will provide function declarations and the corresponding GPP and DSP-side stubs will be generated by a script)
• complete architectural documentation
• expand library of RPC examples to illustrate use-cases with different function signatures

Risks, issues, blockers
• since there is no general way of extracting the number and size of parameters in variadic functions (eg, one function may only specify the number of args as the first fixed parameter and accept only integer args, while another may take a zero-argument as a terminator of a number of float args, and yet another may use printf-style encoding), we can't implement a generic marshaller for variadics. for this reason, support for user-defined variadics is left out for now, although users writing their own variadic functions isn't very common and this is not expected to become an issue.

Monday, July 12, 2010

Weekly Report #8

Weekly Report #8
Submitted on 2010-07-12
Covers 2010-07-05 to 2010-07-12

Status and Accomplishments
• lots of work went to solving the long-standing buffer/pointer parameters and return types issue, and we finally have a reasonably well-working system in place. all address translation / memory space mapping is now done automatically, provided that any pointers to be passed are pointing to memory allocated using the RPC allocator (CMEM based).
• three types of pointer return types are identified and supported: no-translation pointers (such as FILE* which aren't meant to be dereferenced at all), direct-translation pointers (such as strspn whose return value points to somewhere within the passed parameter) and manual-copy pointers (such as ctime when the function allocates memory with some other method and passes that pointer, size information needs to be specified in the GPP stub)
• most C I/O stubs are now complete (the ones remaining are variadics and infeasible things like threads)

Plans and Tasks
• find a way for marshalling variadic function calls (will probably involve manual work) and complete the remaining stubs
• test completed stubs with existing code
• offer more flexibility for buffer/pointer parameters by providing the ability to do address translations manually and maybe detecting non-shared buffers in the DSP-side stubs then syncing them automatically with some shared ones
• support multiple stub input files instead of placing them only in rpc_stubs_gpp.c and rpc_stubs_dsp.c

Risks, issues, blockers
• variadic functions such as printf and scanf are problematic for marshalling - since there is no well-defined way of knowing how many parameters of which size there is, it's likely that the user will have to provide the parameter packing for these manually
• even when the stub generator tool is in place, it's likely that the user will have to provide some manual guidance for certain cases in stub generation such as identifying buffer/pointer parameters which don't need address translation, return types which need manual copying into a shared buffer and variadic functions

Monday, July 5, 2010

So what happened to the variadic marshaller?

As you may recall, I had a variadic marshaller I had been using for the RPC layer for some time now, which recently had to be dropped since it was causing trouble with passing float parameters. I wanted to talk about here a little bit since it's not really a specific problem concerning DSP or marshallers, but rather how certain arguments are passed to variadic functions.

Let's start by talking about what a variadic function is, since you may or may not have heard the term. A variadic function is a function that can take a varying number of arguments. There usually are a few "required" arguments but the sky (or rather, the bottom of the stack :)) is the limit. Sounds familiar? Yes indeed, probably the two most famous functions in the C library are variadic: printf and scanf.

Variadics in C are easily identifiable by their declarations, which looks something like this:

int summation_function(int count, ...)

notice the ellipsis ( ... ) - this is the notation used to state that there will be an arbitrary number of arguments here.

And for quick reference - accessing the variadic arguments is done via stdarg.h macros, for example:

va_list arg_list;
va_start(arg_list, count);
for(int i=0; i < count; i++)
{
sum += va_arg(int);
}
va_end(arg_list);

whose detailed usage descriptions you can find easily by Googling.

Moving back to the problem I had with the variadic marshaller: I was passing regular 4-byte floats as arguments to be marshalled, but somehow the 4-byte region corresponding in the marshalled buffer always showed up to be corrupted somehow.

The issue was eventually revealed to be about the default argument promotions that are applied to variadics, as described in:

http://www.gnu.org/s/libc/manual/html_node/Calling-Variadics.html

therefore, any short int or char arguments passed are automatically promoted to int's, and all float's are casted into double's - thus having a 8 byte representation whose first 4 bytes don't really mean all that much :)

My initial idea was to simply cast the obtained double parameter back into a float before marshalling it into the buffer, but unfortunately this leads to loss of precision during double->float conversion. Therefore, I've decided to switch to using macros and doing the parameter marshalling directly inside the stubs. Comparing what the DSP-side stubs look like in the old vs. new methods of marshalling:

void rpc_mixedprint(int a, char b, float c, double d, short e, int f)

{

rpc_marshal("rpc_mixedprint", "vicfdsi",a,b,c,d,e,f);

rpc_perform();

}

versus

void rpc_mixedprint(int a, char b, float c, double d, short e, int f)
{
   RPC_INIT("rpc_mixedprint", "vicfdsi");
   RPC_PACK(int, &a);
   RPC_PACK(char, &b);
   RPC_PACK(float, &c);
   RPC_PACK(double, &d);
   RPC_PACK(short, &e);
   RPC_PACK(int, &f);
   RPC_PERFORM();
}

I'm aware it doesn't look quite as elegant as the variadic marshaller did... but in terms of ease of stub generation, it's not all that different, and there's no loss of data precision involved. And the GPP-side stub uses macros to extract the parameters as well, once the unmarshaller unpacks the buffer into the void** param_buffer:

void rpc_mixedprint(void **param_buffer, void *result_buffer)
{
   mixedprint(RPC_CAST_PARAM(param_buffer[0], int),
   RPC_CAST_PARAM(param_buffer[1], char),
   RPC_CAST_PARAM(param_buffer[2], float),
   RPC_CAST_PARAM(param_buffer[3], double),
   RPC_CAST_PARAM(param_buffer[4], short),
   RPC_CAST_PARAM(param_buffer[5], int)
   );
}

GSoC Weekly Report #7

Weekly Report #7
Submitted on 2010-07-05
Covers 2010-06-28 to 2010-07-05
Corresponding Draft Schedule Item:
Create the RPC framework that can call functions on the GPP side from the DSP and return values back to the DSP. Implement and unit-test the POSIX function wrappers according to the planned order.

Status and Accomplishments

the RPC framework was tested with many different kinds and combinations of non-buffer parameters, a few bugs unearthed and the issue with floats issued as variadic parameters led to the deprecation of the variadic marshaller. DSP-side stubs use macros to do the marshalling now. not as elegant but works far better.
the GPP-side RPC stubs dynamic link library is now embedded directly inside the resulting executable for cleaner deployment (ie, the user doesn't have to copy the library manually to the Beagle)
rpc_malloc and rpc_free handlers using CMEM for alloc/free and address translation added into the GPP server, so basic buffer/pointer parameters support will soon be in place
implemented a simple version of the ARM function caller, but doesn't work with double arguments or 4+ params of any kind. the macro-based parameter passing method works fine for now, though, and I intend to keep it for a while longer.
all POSIX/C lib stubs for functions that don't take any buffer parameters are now in place (there isn't that many, though :))
first usable version of dsp-rpc-posix committed to the repository

Plans and Tasks

more testing and support for buffer parameters - there's still questions here, see issues section
write stubs for the POSIX functions with buffer parameters/return types
review the build system (how the user provides his/her own sources for use in RPC) and make improvements where necessary, see issues section

Risks, issues, blockers

the user needs to be able to provide source code and declarations for functions he/she intends to use with RPC - how should this be ideally done? keep them in a pre-determined directory and have the user add them there (easy but not very flexible)? pass them to the tool as command line parameters prefixed with something special?
the GPP and DSP-side stubs for custom functions need to be provided manually for now. although it's relatively straightforward to do by hand, it's very very suitable for automation and I intend to have a stub generator script for this. I'd prefer not to have to write a C parser though, anything available I can use for this?
the situation with buffer parameters issue is as follows at the moment: any buffer parameters the user wants to pass on via RPC *must* be allocated via the rpc_malloc call. this call is mapped to GPP-side CMEM allocation and the GPP server performs virtual-to-physical address translation before passing the buffer address to the DSP, so the DSP can directly work on this buffer. but there isn't any direct physical->virtual address translation available on the GPP side - what's the best way to deal with this? currently the C6Run allocator saves 16 bytes of extra info, including the virtual base address, along with the allocated buffer, and this is how we get rpc_free to work. but there'll be problems if the user doesn't directly pass the allocated buffer, but just a part of it - how will we find the virtual address then? even if we can find the virtual address for any given physical address...we can only do address translation if we're aware if it's required. what if the user puts a physical address somewhere inside a struct, or even inside another buffer (say, a void** parameter) ?
one solution to this could be adding the allocated virtual addresses to the DSP MMU TLB and have the DSP work directly with virtual addresses. but there's only 31 slots available in the table :( EDIT this is not a solution at all, the DSP MMU doesn't do any address translation at all (virtual = physical)! it's just for protection (preventing DSP from accessing places it's not supposed to)
another solution could be making this the user's problem: providing virtual addresses from rpc_malloc and making the user do virtual->physical translation on their own (via another RPC function, of course :)) if it's needed.

Saturday, July 3, 2010

DSP-RPC-POSIX initial commit is in place!

I've made the initial commit for dsp->gpp RPC calls (the development had been on my personal SVN repo so far, since I didn't feel it was ready to see daylight :P)

It's nothing spectacular yet (e.g no buffer/pointer parameters or return values allowed, so only ctype and math c library calls) and you probably won't think very highly of the coding style (or the way I do things in the makefiles...) either, but it still works :)

The SVN URL is:

http://gforge.ti.com/svn/dspeasy/branches/dsp-rpc-posix

There's a readme file under the top-level rpc/ directory which should be sufficient to get started.

possible makefile/build issue: the version of LPM I was using (local_power_manager_linux_1_24_02_09) has lpm.av5T instead of lpm_linux.av5T (as was stated in the original C6Run makefiles) so I've changed that. just find/replace it the other way around in build/gpp_libs/modules/Makefile if it complains about that.

Some notes:

I had to put away my variadic marshaller since it was causing issues with floats (why? blog post coming soon!) - all marshalling is done inside the stubs with macros. Takes more lines but it's probably a better idea in the longer run.
The main points of interest in terms of code would be: the files under the rpc/ directory (dsp and gpp side stubs, some additional dsp side functions), rpc_server.c and .h under build/gpp_libs (gpp unmarshalling, symbol location and execution), cio_ipc.c (gpp RPC server, inside the C6Run C I/O server) c6run.c under build/gpp_libs (extracting the embedded GPP stubs library and setting up RPC buffers).
I'm using the C6Run C I/O transport system to pass RPC buffers back and forth, for now.
There's some additional parts in c6run-cc to handle the --rpc switch (add the dsp stub file to the sources list, compile the gpp stubs into a dynamic link library and embed it inside the final executable)
Custom stubs have to be hand-coded, there's no stub generator tool yet

Support for buffer parameters coming soon! (actually, if you write a GPP side stub that does allocation from CMEM or POOL, use this RPC call for allocating memory on the DSP and pass data in this buffer,
everything should work).

Comments, criticism and suggestions are very, very welcome!

Monday, June 28, 2010

GSoC Weekly Report #6

Weekly Report #6
Submitted on 2010-06-28
Covers 2010-06-21 to 2010-06-28
Corresponding Draft Schedule Item:
Create the RPC framework that can call functions on the GPP side from the DSP and return values back to the DSP. Implement and unit-test the POSIX function wrappers according to the planned order.

Status and Accomplishments
DSP->GPP RPC code is now integrated into the C6Run sources and works fine alongside with the regular C6RunApp C I/O calls. The RPC layer is compiled/linked together with the user-provided sources for now as putting them into the C6Run library caused some strange problems (see issues section).
the GPP side RPC stubs are now compiled into a dynamically linked library (.so) and the GPP side server uses dlfcn functions (dlopen, dlsym..) to locate the function with the given name and invoke it. the stubs still pull the parameters from the buffer on their own - assembly code for automating this is still in the works

Plans and Tasks
finalize the GPP-side assembly function caller so that GPP side stubs don't need to be provided at all
experiment with different combinations of parameters, including doubles/long integers and shared-mem allocated buffers to make sure everything is working correctly with the marshaller, DSP/Link transfers, unmarshaller and GPP-side server
finalize the DSP-side stubs for the POSIX functions which don't take/return any buffers as parameters

Risks, issues, blockers
One strange thing that occured: if I compile the RPC marshalling function together with the user sources (thus having the RPC layer in the user code, since readmsg/writemsg is accessible from there) given to c6runapp-cc everything works fine, but when I try to include the functions in the DSP-side C6Run library, something goes wrong. The function is a variadic function whose definition looks like this:

void rpc_marshal(char *function_name, char *function_signature, ...)

When I compile the DSP side libs with this function inside, the parameters after (and including) the second parameter are not passed correctly. I've observed that even the address of the formal parameter goes awry. To illustrate, let's say I put a printf call inside this function (thanks to C6Run CIO :)):

printf("Stack addresses: %x %x values: %s %s", &function_name, &function_signature, function_name, function_signature);

If I compile and use this function alongside with the user-provided sources to c6runapp-cc, everything works correctly, so I get output that looks like this (I've made up the stack addresses but they were similar to these):

printf("Stack addresses: 0x80000800 0x8000804 values: rpc_function iii@");

but if used from inside the library:

printf("Stack addresses: 0x80000800 0x80000854 values: rpc_function ");

Not sure why this is happening - linker settings? a problem with passing variadic parameters? (since if I remove the ... at the end the two parameters get passed correctly, but of course then the others don't). Since it works fine if provided alongside user sources I decided to stick with that for now.

Monday, June 21, 2010

GSoC Weekly Report #5

Weekly Report #5
Submitted on 2010-06-21
Covers 2010-06-14 to 2010-06-21
Corresponding Draft Schedule Item:
Create the RPC framework that can call functions on the GPP side from the DSP and return values back to the DSP. Implement and unit-test the POSIX function wrappers according to the planned order.

Status and Accomplishments

we now have a marshaller which takes a function name, signature and a variable number of arguments and packs them neatly into a buffer to be transferred via DSP/Link (as well as the corresponding unmarshaller :))
all DSP-side stubs are reduced to three lines: rpc_marshal, rpc_invoke and return result, quite suitable for automation
looked into shared memory issues some more without any fruitful effort; for now we'll enforce the user to pass buffer parameters allocated from shared areas only, and develop a better method later

Plans and Tasks

unfortunately I managed to break C6RunApp's existing RTS I/O while trying to integrate the RPC server, I'll work on fixing this - it'll be nice to have both working, see issues section regarding possible shift of focus to general RPC instead of POSIX
the static jump table design used for resolving rpc calls on the GPP right now is awful, write an assembly function launcher similar to the one in C6RunLib

Risks, issues, blockers

shared memory issues continue, but for now we can overlook them since we can force the user to pass only buffers allocated from shared areas (POOL or CMEM). PROC address translation is there but we have to avoid segfaults / protection issues, this'll need a bit more time
as I worked more and more with C6RunApp and the existing RTS C library implementation, I've realized that the available functionality is quite sufficient for prototyping purposes. so instead of replacing this with RPC POSIX calls, we can have both methods and leave the choice to the user. it also may be appropriate to shift the focus of the project on RPC issues, providing a library of examples which can be built up to useful code, or focusing on other ease-of-development issues.

Friday, June 18, 2010

A brief overview of RPC over multiple cores

RPC (which stands for Remote Procedure Call) is defined as follows by Wikipedia:

Remote procedure call (RPC) is an Inter-process communication technology that allows a computer program to cause a subroutine or procedure to execute in another address space (commonly on another computer on a shared network) without the programmer explicitly coding the details for this remote interaction (...)

Within our context of heterogenous OMAP3 multi-cores, it doesn't exactly refer to the same thing as it does in general, but still, there are enough familiarities. For example, the DSP and GPP use the same memory (ie, the DDR RAM on the Beagle) as main memory, but this doesn't make their address spaces the same - to start with, the DSP works with physical addresses whereas the GPP works with logical ones. And the binaries are obviously not compatible since we have two different processors and architectures (hence the "heterogenous" in "heterogenous multicore processing"). If we want the GPP and DSP to co-operate towards the same goal (especially in situations like video decoding which we'd really appreciate the DSPs help), it would be useful to get them to talk to each other without much effort. The DSP should be able to easily obtain the data it needs to work with (which usually resides in the GPP file system or program memory), process it and hand it back to the GPP.

While all of this is quite feasible as of the moment, doing something like this is far from trivial (set up OE, build DSP/Link, examine the examples, learn the DSP/Link API, set up the GPP side structures for bringing up the DSP, try to get the DSP things working without being able to see what you're doing unless you have CCS or are using C6RunApp ;) and so on...). And DSP side code is C code. TI's CGT6000 is a C compiler. You could essentially write the same things and compile/run on either side. Why all this hassle? Why not be able to pass parameters and call functions from one side to the other? Why not live in harmony and peace? Then RPC is the answer! (maybe except that last one).

Scrolling down the Wikipedia article, we see the steps involved in making RPC work; and it's rather trivial to see these from a DSP to GPP RPC point of view. Let's say we have this function int gpp_side_function(int param1, float param2, char *param3) somewhere in the GPP side, which we want to call from the DSP.

The client (DSP in our case) calls the Client stub, which has the same definition. The call is a local procedure call, with parameters pushed on to the stack in the normal way.
The client stub packs the parameters into a message (done according to some predefined structure, e.g 2 bytes function name length, n bytes function name, then 2 bytes parameter count, then each of the parameters with their lengths, etc.) and making a system call to send the message. In our case, sending the message is putting a message on the DSP->GPP MSGQ. Packing the parameters is called marshaling.
The server (the GPP side in our case) receives the message, unpacks (unmarshals) it and locates the corresponding local function call stub.
Finally, the server stub calls the desired procedure. The reply traces the same in other direction.

Similarly, one could to GPP->DSP RPC calls - a component of the C6Run project called C6RunLib is meant for this purpose.

There are certain implementation details that need special attention in our case:

Passing Pointers as RPC Parameters

Since the addressing mechanism of the GPP and DSP are different, we need to make sure that any pointer parameters / buffers which are passed are accessible from both sides. When doing GPP->DSP calls, one can use the CMEM module to obtain a shared region of memory and translate the addresses forth and back to ensure mutual accessibility, but there is no CMEM interface on the DSP side, thus one must resort to workarounds. If the size of the buffer parameters is always available, one can pass the contents of the buffers themselves back and forth via DSP/Link. One approach could be using the DSP/Link POOL module which allocates shared buffers and provides address translation, though this will not be suitable for large amounts of memory usage since there is a limited number of constant-sized POOL buffers. Another could be using a special protocol to access the CMEM interface on the GPP side and doing all the allocation from there, but both in this case and the POOL buffer case we have to ensure the passed pointer parameters are allocated with our special method and not from the DSP stack region, otherwise we'll be in trouble.

Openness to Expansion

Having a RPC framework that can only call 5 (or 50, or 100) predefined functions on the other side is not very useful, one wants to be able to define and call one's own functions. And to be able to do this, the framework must not have too many hardcoded details such as when to do the address translations mentioned above. To address this, I'm using a function signature system in my implementation to be able to identify parameter types and where special attention is needed. For example, for our above function

int gpp_side_function(int param1, float param2, char *param3)

the corresponding signature would be

iif@

where the first symbol identifies the return type as an integer, and the other three identify parameter types. The final @ indicates that some manner of address translation (or passing the buffer back/forth, if that's the preferred approach) is needed: the marshaler may take care of the translation before passing the message, or the server stub may do the translation itself, depending on how the protocol is defined.

Also, I've designed the marshaler itself to be as user-friendly as possible - let's say you want to provide the DSP side function stub for the above function. All you have to do is pass the function name, function signature (which can be extracted via a simple lookup table) and all the parameters to the marshaller, prompt the data transfer, and return the result with the desired data type. So the stub for the above function would look like this:

int gpp_side_function(int param1, float param2, char *param3)

{

rpc_marshal("gpp_side_function", "iif@", param1, param2, param3);

rpc_makecall();

return RPC_GETRESULT(int);

}

The process is quite suitable for automation, and I'm planning to have a script that auto-generates the stubs given the definitions.

Finding and invoking the corresponding GPP stub

It's easier said than done - so you have a string giving you the GPP-side function name and a bunch of parameters. How do you locate the corresponding function, and how do you actually make the function call? So far, I've been thinking of these approaches:

Use a static jump table. An intuitive albeit not-so-elegant solution. Associate each function name with a number (perhaps via hashing?), and use a function pointer table to jump to the stub. The stub has to take care of the parameter demarshaling and passing the parameters. A static approach, so one has to add the table entry as well as the GPP-side stubs.
Use an assembly function caller. Once we are able to decode the function address (perhaps statically as in 1, perhaps via dynamic loading with libdl functions) we can have an assembly routine which pushes the parameters onto the stack and call the function living at the located address. This is what C6RunLib uses, and will be the eventually preferred method.

Monday, June 14, 2010

GSoC Weekly Report #4

Weekly Report #4

Submitted on 2010-06-14

Covers 2010-06-07 to 2010-06-14

Corresponding Draft Schedule Item:

Familiarize with DSP/BIOS Link and its various modules. Experiment with GPP-DSP communication and compilation processes to identify potential issues and useful features. Review existing RPC protocols and create one suitable for DSP-GPP communication over DSP/BIOS Link.

Status and Accomplishments

My exchange period in Sweden ended and I had to move back to Turkey, so couldn't really get as much work done as I'd like to, but now I have access my evil scientist laboratory headquarters :) I'll be based here for the rest of the summer.
I now have an RPC implementation for DSP->GPP calls, composed of four parts: DSP-side function stubs, DSP-side marshaller/unmarshaller and transport functions, GPP side marshaller/unmarshaller and transport functions, and GPP side stubs/invokers. No dynamic loading on the GPP side, the invocation is a simple static jump table and there's no support for pointer parameters but it works fine otherwise.
had some time to play around with kernel module loading. didn't really discover anything groundbreaking, but noticed that C6Run doesn't work with the *.ko's that came with my Ångström. probably version differences which won't be an issue if one builds C6Run libraries and the distro itself using OpenEmbedded, but in the other case we could have a config file on the Beagle which points to the appropriate module filenames, and load these ones at startup instead.

Plans and Tasks

right now the stubs are doing most of the work for marshalling, this will make it difficult to add new RPC functions in the future. have a marshaller that can pack variable number of parameters into the data, I've already have had some success with this. I'm planning to have a string "function signature" for each stub, and feed this along with all the parameters to the marshaller. bad idea? should each stub keep doing its own packing to increase efficiency?
passing pointer parameters is still a big issue due to memory sharing, we have to make sure that any direct memory pointers are mutually accessible after address translation, look into this.
finalize and commit my initial group of DSP-side wrappers with integer parameters and return types (mostly is*() functions from ctype.h)

Risks, issues, blockers

Memory issues still here: C6RunLib can utilize the CMEM interface to get DSP-GPP shareable contiguous memory regions, but there is no CMEM on the DSP side - what'll be the cure for this? force allocating everything from POOL buffers if they are to be passed as RPC parameters? set the DSP linker parameters to allocate the heap from a pre-determined shared region and do the address translation manually? copy and back-copy the DSP-side buffers manually into/out of the message buffers during marshalling and unmarshalling (but how do we know the size of a void* buffer, for example?)?

Wednesday, June 9, 2010

C6RunApp and DSP-RPC-POSIX

Now we've looked into how C6RunApp works in general, it's time to bring DSP-RPC-POSIX into the picture. Although the project builds largely on top of C6RunApp in general, there are two keywords in the name which hint into a different direction:

RPC: the project aims to bring a fully functional RPC (Remote Procedure Call) framework into the picture in the sense that GPP-side procedures are remote procedures when viewed from the DSP side. In other words, the DSP will be able to call any GPP-side function using this component. Note that there already exists a limited form of RPC in the existing C6RunApp structure, but only for the file system access calls.
POSIX: using the RPC component, the GPP-side POSIX library will be made accessible from the DSP side.

At this point, the question "but the RTS lib already provides POSIX functionality! why's a RPC POSIX layer necessary?" may come up. These were my primary reasons for going in this direction while writing my project proposal:

being able to offer the already stable GPP-side POSIX libraries with no extra effort required
have greater flexibility for expansion since it can eventually be used to call any GPP-side function, such as writing to the frame buffer or user-defined functions.
although the DSP is an indeed powerful processor it is not meant for general computing, so it's not practical for, for example, string processing while formatting printf strings - this is a task better done by the GPP

Right now, I'm still experimenting with RPC-related tasks such as packing ("marshalling") and extracting ("unmarshalling") a variable number of arguments into/out of messages and dynamic loading, but the first wrappers should be appearing quite soon :)

C6RunApp

In my blog post entitled (rather appropriately, in my opinion :)) Moment of Truth #1, I had briefly mentioned what C6RunApp allows you to do - you write C code in the conventional way you would for the ARM (including things like debug outputs with printf or data input with scanf), and then use the C6RunApp script to get an ARM executable which actually runs everything you wrote on the DSP (except things that require access to the ARM side itself, but we'll get to that in a moment).

First, let's make a mention of the DSP RTS library as it's a relevant component on which C6RunApp itself builds on. The TI CGT (which stands for Code Generation Tools and is essentially the DSP side compiler, linker and other binary utilities) C6000 contains a library of standard C functions which is aptly called the Run Time Support (RTS) library. As you would expect from such, it contains implementations of regularly used C lib functions such as printf and scanf. But of course, the problem is not writing C lib function implementations on the DSP (which is quite a capable processor) - it's things that actually require GPP side capabilities such as file system access (which in turn can also provide console input and output). The implementations of these functions in the RTS use a number of base lower-level functions (such as open, close, read and write) to carry out the needed GPP side communication, which can be user-defined. For example, if one is using the CCS (Code Composer Studio) for the development process, CCS has the driver which provides the communication between the host which is running the development environment and the DSP, so that you can see the terminal output and access local files.

Of course, this is not a very convenient scenario as we are dependent on the CCS for file system access. Why not just the GPP side OS (that is, the Ångström distribution) instead? This is one of the underlying ideas for C6RunApp: we have a host application on the GPP side which recieves requests over DSP/Link, performs the necessary file system calls, and passes back the results over to the DSP again. Another idea is that the GPP host app takes care of setting up the DSP/Link and loads the DSP with the DSP-side executable without any effort from the user. Combining these two ideas that provide us with "verbose" DSP side programs and abstract away the details of DSP/Link, we get easier DSP side development - we get C6RunApp!

C6RunApp Workflow

So what happens when you want to compile hello_world.c using the C6RunApp cross-compiler script? Let's have a look here first, and then we'll examine the involved libraries in some more detail. This is what the C6RunApp readme file has to say on the subject (with some small clarifications from me):

The DSP tools are used to build the supplied source and link it against the prebuilt C6RunApp DSP-side library. The result is the complete DSP-side executable image in standard TI COFF format.
The DSP executable image is minimized in size by using the symbol stripping tool, strip6x.
The contents of the stripped DSP executable file are converted to a C byte array in a temporary C header file. This header file is referenced by the main GPP-side C6RunApp loader, and thus the DSP executable image will be embedded into the final resulting GPP executable.
The main C6RunApp loader program is built using the ARM cross compiler tools, including the DSP-side executable inside of the binary ARM ELF executable. This ARM executable is the same name as specified on the command line of the C6RunApp cross-compiler script.
Once the GPP executable is ran, it sets up the DSP/Link, loads the DSP with the in-built DSP executable and initializes needed communication channels (namely, constructing the GPP->DSP message queue and locating the DSP->GPP message queue).
The GPP executable waits for the DSP to send it file system call requests, performs the requested ones and sends back the results, until it receives a signal that it can terminate.
Teardown is performed on the DSP/Link setup and the DSP is cleanly shut down.

C6RunApp Components

Let's have a look at the pieces of which C6RunApp consists:

The DSP side library, C6RunApp_dsp.lib - the library which the user-provided code is linked against, contains entry and exit points for the DSP/BIOS, initializes the communication channels and starts running the user-defined main(). It provides implementations of writemsg and readmsg which the DSP RTS lib bases the low-level communications on. These implementations pass the requested function call to the GPP via the message queue and read back the result in the same manner.
The GPP side library, C6RunApp_gpp.lib - the library that contains the functions which serve the DSP's C I/O requests.
The GPP main object, C6RunApp_load.o - once the DSP side executable is created and turned into a header file, the binary object C6RunApp_load.o is linked against the GPP side library to create the final GPP executable
The kernel modules - not really components so much as dependencies. C6RunApp utilizes CMEM for the initial loading of the DSP executable, the LPM to do a clean shutdown of the DSP as is needed on OMAP3530s, and the DSP/Link module for some obscure purpose :)

Hopefully this will have provided some insight into how the magic of C6RunApp works - coming up next (but sooner this time!) is where my GSoC project DSP-RPC-POSIX fits in with all of this.

Monday, June 7, 2010

GSoC Weekly Report #3

Weekly Report #3

Submitted on 2010-06-07

Covers 2010-06-01 to 2010-06-07

Corresponding Draft Schedule Item:

Status and Accomplishments

more experimentation with DSP/Link APIs and general reading especially for the DSP side arch — I feel confident and comfortable enough to seriously start working
studied the C6RunLib source code for inspiration on RPC calls and got some — but I still have doubts on implementing the exact same system, see issues section below
experimented with some GPP-side tasks which can be of use for the RPC framework such as passing a variable number of parameters to functions and methods for dynamic and static linking to dynamic link libraries
did some basic RPC tests using the existing C6RunApp writemsg/readmsg architecture, all went well (nothing fancy — was mainly for the purpose of testing my understanding of how things work)
looked for existing statistics and similar documentation on which POSIX functions are the most commonly used without any luck...decided to follow a C standard library reference documentation and start with the simpler functions, see plans and tasks section below

Plans and Tasks

finalize and make a working draft of the RPC system — possibly without dynamic loading on the GPP side at this stage, which can be added later on
finalize blog post on C6RunApp architecture and how DSP-RPC-POSIX fits in — hoping that increasing awareness will stir more interest
write the first wrapper(s) for the DSP side, starting with noncomplex (in terms of having a constant number of basic non-pointer parameters) functions (e.g putchar())
experiment a little with possibilities of user-friendly features such as checking and auto loading of kernel modules at program startup and catching Ctrl-C signals to perform DSP side cleanup before exiting

Risks, issues, blockers

I want the RPC system to eventually work with any user-defined GPP side function (not just C I/O) without much hassle, and I believe dynamic library loading will be necessary for this (ie, given the function name, parameter type list and library name, be able to locate the function and call it on the GPP side). I realize this can be done by preprocessing the source code and adding the function defs before compiling but I'm curious if it can be done at runtime as well. C6RunLib uses this preprocessing method (via perl scripts) but in that case we provide and therefore have access to source code of all the possible RPC fxns anyway
Memory issues continued: Is it a really good idea to do RPC GPP calls for functions that take pointers as parameters and modify the pointer data, like malloc() and memset() ? They're already present and working in the DSP RTS libs, so is it necessary? If so, we could do address translation before calling the actual GPP side function, can we ensure that the memory is mutually accessible by the processors by doing all allocation from POOL buffers? The issue eventually extends to any function that takes a pointer parameter.. I'd appreciate some broader-perspective views on this subject.

Wednesday, June 2, 2010

A bit more about DSP/Link

My original plan for this particular blog post was to do a write-up on what the OMAP3530 multicore architecture looks like, where the DSP sits in this picture, how DSP/Link comes into play and what it can do for us. But then again, I noticed the document which I learned most of these thing from is an excellent source of information and doesn't really need a re-write, so I'm just going to offer my salutations to the ETH PIXHAWK MAV project and invite you to read their excellent DSP/Link API guide :)

Coming up soon: how C6RunApp works, and how my project DSP-RPC-POSIX will fit in it to take things a bit further.