Published Jan 20, 2021
Nothing goes as planned but having an idea of what might change helps a lot. Nevertheless, we strive to hang in there and make every single day productive and progress a little. Halfway through the internship is a good place to look back and re-estimate what is coming forward. According to my initial internship plan, by half-time, I should have been done with creating a data pipeline to fetch all Lua modules across wikis and performed necessary data analysis on it to identify which modules were important to be centralized in Abstract Wikipedia. Things went almost as planned except there were much more barriers to overcome than I had imagined.
One of the reasons things are almost as planned despite all the barriers is that I’m not alone in this project, two of us were selected for this project and it turned out to work for the best for me. This allowed me to learn and engage more with git through issues, code-reviews, maintaining code understandability, and code standards. Also whenever I get stuck I could interact directly with her and we would get things fixed much faster, through chat or a quick call.
During the contribution phase, I had gotten a taste of what I’d be dealing with and I thought I could sweep through it easily. Except that this time I was working will all wikis, which means across language, across projects like Wikipedia, Wiktionary, Commons, Wikivoyage, etc. Some of these I had not even heard about before and I am pretty sure no one encountered them with ordinary google searches. This made me deal with a much larger amount of data. Of course, pandas and CSV were out the window and we started using more SQL and self-created databases when fetching and saving data. The next most important set of issues was all the possible kinds of errors I had to solve in the scripts that fetch and save data. Memory errors lead me to create generators and query from databases in parts. Then ‘connections lost’ and other database errors haunted me for a good two weeks, which I handled with try-catch and various maneuvering with database connection and cursor objects. Finally, when I started with the data analysis part of the project the view of the data was hard to grasp, everything was super skewed. Nothing could be plotted if not on the log scale. Also, I wanted to create some interactive plots to better play with the data and also create nice dashboards. But with this amount of data it was simply not possible, and the 1 or 2 interactive plots I attempted slowed the whole browser down and created heavy files (that take forever to load).
Anyway, I have learned a lot through these and glad they came along. Till now I have completed creating a data-pipeline and performed by initial run of data analysis on the data I collected. For the next steps, I intend to document and share my findings, incorporate feedbacks and move on thinking about how to use machine learning to approach this problem (identifying important modules). Meanwhile, my project and internship partner is working on finding ways to find the similarity among source codes. Further down the line, our paths will cross and we will merge our work to suggest important modules that can be centralized in Abstract Wikipedia and what other codes seem similar to the ones being suggested.