Large file parsing with python

  • 0
This week I came across to an interesting problem. While I was parsing a big file using sed I came across with an error "Argument list too long". I knew that error and sed was right I was doing my job on a big file with long lines. Also I had unnecessary loops in my code which I wanted to eliminate, maybe with a hash like structure.

So I started looking for a solution. My friend offered me Perl's Tie::File module, but I was too lazy to install needed RPM's on my machine; also I needed hash like structure(Tie::File::AsHash maybe solves that problem, whatever).

And then I came across with Python's shelve module. That was my solution. Doing parsing using Python's dictionary without memory concerns because everything is kept in a DB like file, great!

Basically what shelve module does is you use dictionary like object in your code and that dictionary is kept on the file-system, not memory.

Enough talk let's show some code. As an example I will implement a basic version of GNU join command to illustrate how shelve works.

Assume we have two files.

First one has COURSE_NAME | STUDENT_NAME1, STUDENT_NAME2
Second one has STUDENT_NAME | COURSE_STUDENT_TAKES

Here is our code
Example input/output will be like below.

Happy hacking.

Edit 1:
It's really slow with big data, be careful :)  In such case (IMHO) use a non-rel db, mongoDB is good.

Installation elasticsearch, haystack, django

  • 0
We are using:

  • Centos 5
  • Django 1.4.3
  • Haystack 2.0.0-beta 
  • Elasticsearch 0.20.1

Installation is straightforward:


Now let's create our Django project:

Now we should edit 'settings.py' in order to use elasticsearch in Django:


And we are done with installation part.

Senior Project

  • 0
As our senior project we are building a real-time log viewing and filtering utility. We will be using tools like Django, elasticsearch. From now on I will try to write what we did and how we did it with these tools. And I think in June 2013(maybe earlier) I will be able to put our project on github. That's all for now.