Knowing what to say is not always easy - especially if you're a chatbot.
Generating answers from scratch is very difficult and would most likely result in nonsense or worse - but definitely not a pleasant user experience. Therefore we're taking one step back and instead provide the correct replies which now "only" have to be chosen in the right dialog context.
In this project you're given a dataset with fictional dialogs (adapted from ) from which one reply is missing and additionally a list with all missing replies. Your task is to map all missing replies to the correct conversation.
The dataset consists of 4 files: train_dialog.txt and test_dialog.txt each contain the conversations. The format is always c##### indicating the conversation number separated by +++$+++ from the reply text. For example one conversation from the training set is the following:
c03253 +++$+++ Wow! This is like my dream room! Are these all records!
c03253 +++$+++ I have about fifteen hundred 78s at this point. I've tried to pare down my collection to the essential...
c03253 +++$+++ God, look at this poster! I can't believe this room! You're the luckiest guy in the world! I'd kill to have stuff like this!
c03253 +++$+++ Please... go ahead and kill me! This stuff doesn't make you happy, believe me.
c03253 +++$+++ You think it's healthy to obsessively collect things? You can't connect with other people so you fill your life with stuff... I'm just like all the rest of these pathetic collector losers.
All original conversations are at least four lines long and always the second to last line is missing in the dialogs.
The missing replies are found in the files train_missing.txt and test_missing.txt respectively. For the training dialogs, the conversation number is given with the reply as in the dialog files, e.g. the missing line to the above conversation would be
c03253 +++$+++ Oh, come on! What are you talking about?
The missing lines for the test dialogs always have c00000 as the conversation number but are otherwise formatted the same as the training file. While some of the short replies might be the same, every missing reply belongs to exactly one conversation.
Your task is now to take the missing test replies and map them to the corresponding dialogs. More specifically you should write a script match_dialogs.py which can be called with the path to a file with the incomplete dialogs and the path to the missing replies and then outputs a file test_missing_with_predictions.txt in the same format as test_missing.txt only with actual conversation numbers from test_dialog.txt instead of c00000.
You can chose whatever approach you want to solve the task, we only ask you to please write your code in Python 2.7 and if you use any external libraries provide a requirements.txt file from which these libraries can be installed with pip install -r requirements.txt (you might want to use a virtual environment and when you're done call pip freeze > requirements.txt).
While it is okay to use other resources such as pretrained word embeddings to solve the task, we ask you not to train your algorithm using the original conversations provided with  as this would lead to overfitting, i.e. considered cheating.
You should turn in all the code required to solve the task, i.e. which allows us to create the test_missing_with_predictions.txt file from the file without labels. Besides the accuracy of the predicted conversation labels we will also evaluate your code with respect to efficiency, maintainability, and readability (it might not hurt to have a look at some style guides).
In addition to the code which solves the task please turn in a text file or pdf with answers to the following questions:
Describe your approach. Which methods did you chose and why?
How do you evaluate your performance?
Where are the weaknesses of your approach? What has to be considered when applying an approach like this in practice?