Monday, August 16, 2010

Finaly Weekly Report, no #13

So here comes the final weekly report... I just want to say once again what a wonderful experience it has been to work with the BeagleBoard for GSoC! As I stated in my previous week's report, I do believe I've managed to complete most of what I (and my mentors) had in mind for the end of the GSoC period. There's (as always!) more to do, of course... and I intend to continue development for C6Run in specific and BeagleBoard in general (though with the start of the academic year, it's unlikely that I'll be able to devote as much time), though right now I could really use a little break, and will be taking one :)


Weekly Report #13 
Submitted on 2010-08-16
Covers 2010-08-09 to 2010-08-16

Status and Accomplishments
• documentation moved to its own eLinux wiki page (at http://elinux.org/BeagleBoard/GSoC/2010_Projects/C6Run/Documentation), divided into two sections (Usage, Architecture) and expanded
• build system modifications: support auto-copying of the generated stub files by the stub generator utility, make a folder for user sources
• code cleanup: move GPP&DSP common definitions to a single header file, work on GPP-side pointer alignment problems, partition GPP-side code into modules

Wednesday, August 11, 2010

Documentation's moved

Just writing to let my readers know that the C6Run documentation has moved to the elinux wiki and will be residing there from now on:

http://elinux.org/BeagleBoard/GSoC/2010_Projects/C6Run/Documentation

I've been working on expanding and improving it yesterday, and there's more to come.

Monday, August 9, 2010

Weekly Report #12

Weekly Report #12
Submitted on 2010-08-09
Covers 2010-08-02 to 2010-08-09

Reflections
As we're past the soft deadline (aka "the suggested pencils down date" in GSoC speak) I wanted to write a bit about how DSP-RPC-POSIX has progressed so far. I'd say that the project is now in a state that can do most of the things that me (and my mentors) had imagined it would be doing by now. C6Run's own goals of easy DSP app prototyping with console/file system access are pretty much accomplished, as is being easy to get started with (although there are debatable issues surrounding this particular goal, see below). DSP-RPX-POSIX's twofold goals, which were basically being able to call GPP side functions from the DSP, and providing a prebuilt set of POSIX (actually, "standard C Library" would be a much appropriate term) remote accessor functions, are also met. Right now, a sample DSP development cycle for a beginner developer utilizing the aid of C6Run could look like:

• obtaining the sources from SVN
• setting up (which includes a script that downloads and sets up the C6Run dependencies) and building
• using C6RunApp and DSP-RPC-POSIX to construct, test and debug the wanted DSP-side algorithm
• using C6RunLib to wrap up the finalized DSP-side algorithm with an ARM library and use this library to build DSP-aided GPP apps

The specific capabilities and properties of DSP-RPC-POSIX include:

• ability to call most C standard library functions on the GPP side without any extra effort (stubs are already generated)
• ability to call any GPP-side function, with support for all basic parameter types and buffers and obtain the return value, provided that stubs for the target function exist
• a stub generation tool to easily generate stubs for GPP-side functions (given a number of C source files, it will generate the stubs necessary for RPC access to all the functions contained)
• function signature system allows addition of new data types for parameters/return types with special treatment
• architecturally, the RPC layer is mostly seperate from C6Run's other functionality so it could be taken apart and used in other projects

There are, of course, certain shortcomings:

• C6Run still does not build as part of OE. the build system has been modified to skip rebuilding existing dependencies, but there are still problems building for the Beagle. once these are fixed it won't take long to have a OE recipe in place. to my knowledge, there are folks at TI which are already working on this.
• pointers/buffers issues: double/triple/etc. pointers, pointers inside structs, pointers hidden inside buffers (ie, any pointer that is not explicitly declared as a parameter) can't be handled automatically and need to be manually address-translated by the user.
• structs can't be passed as parameters since a different signature character would be needed for each, they need to be passed as pointers to structs instead
• pointers/buffers to be passed via RPC need to be either less than a fixed number of bytes, or be allocated using rpc_malloc instead of from the DSP heap or stack

Status and Accomplishments
• DSP-side caches are re-enabled for all platforms, stubs that use pointer/buffer parameters call the needed writeback/invalidate functions when needed to keep cache coherence
• ARM-side cache coherence code and cache coherency testing example in place
• stubs for string.h functions and POSIX low level I/O functions

Plans and Tasks
• what Google suggests after the soft pencils down date - code scrubbing (mainly fixing possible memory alignment issues, moving commonly used things to GPP/DSP shared header files etc.), improving documentation and writing more tests/examples

Risks, issues, blockers
• for some reason (probably due to the absence of a BCACHE_wbAll function on DSP-side?) enabling ARM caches still results in cache coherency problems.
• the C6Run trunk still produces problematic executables for the BeagleBoard, so a merge with the dsp-rpc-posix branch doesn't make sense at this point, which means OE integration won't make it to the GSoC deadline

Wednesday, August 4, 2010

Cache Coherency

The presence and usage of ARM and DSP side caches poses a cache coherency problem when we want to access a shared area by both processors. Let's consider the following scenario:

1. A CMEM buffer is allocated for shared usage by the GPP side and its physical pointer is passed to the DSP.
2. The DSP wants to read and then write some data into this buffer. Let's say that there are free entry slots in the DSP L2 cache - so the data actually gets written to the DSP cache, instead of making it to the DDR shared region. The DSP then signals the GPP that it's done with the buffer for the time being.
3. The GPP attempts to read the buffer, but what it reads is just the garbage values present in the buffer after initialization since the DSP-written data is in the DSP cache. This buffer also gets cached in the ARM-side now, so when the GPP tries to write some new data into it, it stays into the ARM cache and doesn't make it to the main memory either.
4. If the DSP tries to read the cache now, it won't get what the ARM has written into it most recently since it'll be reading from its own cache, and vice-versa for the ARM side.

We can see that the "same" buffer actually exists in three different locations (main memory, DSP cache, ARM cache), all of which can contain totally different data - in this case it is said that they are not coherent, and that we have a cache coherency problem.

In most cached systems, there are cache coherency protocols which prevent these situations from occuring. The TMS320C64x+ DSP Cache User's Guide states:

In the following cases, it is your responsibility to maintain cache coherence:
• DMA or other external entity writes data or code to external memory that is then read by the CPU
• CPU writes data to external memory that is then read by DMA or another external entity

thus we have to manually maintain cache coherence for mutual access to CMEM regions by the DSP and the GPP. Studying the scenario above, we can observe that there are two underlying problems:

1. If the memory block to be read already exists in the local cache, there's a risk that the local cache is outdated: we need to discard the local cache entries and fill them up with information from the main memory. This process is called cache invalidation.
2. When the memory block is to be written into, there's a risk that the info remains in the local cache and doesn't make it to the main memory: we have to make sure that the new info gets written to the main memory as well. This process is called cache writeback.

Therefore, from a RPC perspective, for a call that involves transferring buffers, the steps we have to take are as follows:

1. Before passing the marshalled info via DSP/Link, the DSP must do a cache writeback
2. Before passing the params to the GPP side stub, the GPP must do a cache invalidate
3. After the GPP side stub is finished, the GPP must do a cache writeback
4. The DSP side stub must do a cache invalidate before terminating

This is assuming that both processor caches will be active - in case the DSP cache is disabled, the steps 1 and 4 will not be necessary, and likewise with steps 2 and 3 for a disabled GPP cache.

Monday, August 2, 2010

Weekly Report #11

Weekly Report #11
Submitted on 2010-08-02
Covers 2010-07-26 to 2010-08-02

Status and Accomplishments
• an elementary version of the mechanism to be able to pass DSP-allocated (stack or heap) buffers to RPC functions is now in place. what essentially happens is that when the GPP side detects a non-CMEM-allocated buffer is passed as a parameter, it uses PROC_read to read from the DSP memory and copy this into a regular GPP buffer which the functions can access.
• bugfixes for RPC functions returning structures - but pointer parameters inside structs still remain an issue and have to be translated manually
• dsp-rpc-posix branch re-synced with c6run trunk, but to no avail (see blockers section)

Plans and Tasks
• the ARM and DSP caches are both disabled for dsp-rpc-posix to deal with cache coherency, but this impacts the memory access performance quite negatively - enable them again and use writeback/invalidate functions to keep caches in sync
• some POSIX functions are still missing from the RPC library since their corresponding header files (with type definitions and all) don't exist for the DSP side. this includes useful things like ioctls - find a way to expose these through RPC

Risks, issues, blockers
• despite all the time I spent on it, I couldn't uncover the cause of the "bus error" that occurs when I compile even the simplest things (such as hello world) with the latest C6Run trunk (so it's not just my synced branch that's troublesome). in some cases it's just "bus error" and the program stopping, in others the PROC_setup fails. the code responsible for these parts hasn't really changed so I'm suspecting it to be a build/configuration issue. due to this and the fact that my primary computer's hard drive crashed (I'm stuck with a silly little netbook!) I haven't been able to try and build C6Run inside OE. I'm not sure if this is just happening for me or the c6run trunk is broken (there are indeed some errors that prevent it from building but they are small and easy to fix)