# Python for Beginners -- TD 5

[Fanny Ducel](https://fannyducel.github.io/teaching/) (CC BY-NC-SA)

Reminder (see [Intro](https://members.loria.fr/KFort/files/fichiers_cours/M1_Python4Beg_Intro_2024.pdf)):
* work by pairs (or alone)
* never the same pair
* the work should be finished in class (no homework)

* your names should appear in the file name:
    * TDX-YourNames.ipynb (sometimes, it will be .py)


* put your work on Arche under TDX (X being the TD number)
    
Do not forget to (it's part of your grade):
* comment your code (not everything, but the choices you make)
* test the extreme values (add tests to your code, then comment them out, so that we can see them)



# Introduction

For this week's lab, you will have 6 different exercises to do, all related to a same file. However, exercises 5 and 6 are optional, so it is totally okay if you don't have time to complete them.

Further, the exercises are quite independent from one another, so if you're stuck on one, you can skip it and come back to it later (except for ex 5 and 6, that will require you to be done with previous exercises). 

## Ex 1 - Let's have a look at a transcription file

For this week's lab, we will work with a ".trn" file: SBC055.trn (found on: https://linguistics.ucsb.edu/sites/default/files/sitefiles/research/SBC/SBC055.trn). Download it and put it in the same folder as this notebook.

It is a transcription from an audio recording, so it has some specificities. Let's have a first look at it and create some functions to get some statistics on what's inside.


### 1) Print the vocabulary = all the words of the file, without duplicates

In [None]:
# TODO!

You will notice that there are lots of number (= timestamps) and other unwanted characters (they represent linguistic/phonetic information that we don't need for this lab). Don't worry, we will take care of these in ex 2.

### 2) Create a function that returns the number of words present in the file. It should return both the number of total words and the number of unique words (= no duplicates!)

Optional: You could also create a first function to read the file (that takes the path of the file as input and returns a string with what is inside the file as output), and use it inside the function that returns numbers of words. If you create it, you can re-use for next exercises as well!

In [None]:
# TODO!

### 3) Create a dictionary to see the vocabulary with numbers of occurrences

Example:
`{'good': 6, 'and': 55, ...}`

It indicates that the word 'good' is in the text 6 times, the word 'and' is in the text 55 times, ...

In [None]:
# TODO!

### 4) Create a function that looks for a word and returns the number of times it is present in the file (the word we look for should be a parameter of the function). 

Example:
`count_word("good", transcription)` returns 6

In [None]:
# TODO!

## Ex 2 - Let's "clean" the file! 

As you probably noticed in ex 1, the file has unwanted information (= a lot of characters/numbers we don't need for this lab, but that are useful for linguists who work on oral language). 

For this exercise, you don't have to code, I did the job for you! However, I want you to try and understand my code in depth. Pay special attention to the lines with a #TODO, and write additional comments to explain what is going on (what the specific line of code does, how/why it works, what would happen if we remove it).

Even if you are not writing code for this exercise, please focus. Knowing how to read someone else's code is a fundamental skill as you will often have to work with other people, or to use other people's code for your own work.

*Tip: add `print` to better see and understand what is happening at a given line!*

In [2]:
with open("SBC055.trn") as f:
    # read by line (so that it's clearer)
    file = f.readlines()
    
# Make a big list of all the cleaned lines
cleaned_file = []

for line in file:
    # timestamps, speaker and utterances are separated by tabs, so we use the tabulation character \t and split
    # to only keep the speaker's name and what they say
    cleaned_line = line.split("\t")[-2:]
    
    # We remove the non-alphabetical characters that we don't need in the utterances
    for char in [".", "\n", "[", "]", "@", "%", "<", ">", "(", ")", "=", "2", "3"]:
        cleaned_line[1] = cleaned_line[1].replace(char, "")
    
    # if the speaker's name is empty, we don't keep the empty string
    # Exercise: Try and understand these lines of code (explain it on the following line)
    # TODO: your explanation on how it is accomplished (understand the syntax) here
    if not cleaned_line[0]:
        cleaned_line.pop(0)
    
    # If there's no speaker's name, we add the utterance to the previous list (with the person's name)
    # Exercise: Try and understand these lines of code (explain it on the following line)
    # TODO: your explanation on how it is accomplished (understand the syntax) here
    if len(cleaned_line) < 2:
        cleaned_file[-1][-1] += " " +cleaned_line[0]

    else:
        #TODO: What does the following "if statement" does? What happens if you remove it?
        if cleaned_line[1]:
            cleaned_file.append(cleaned_line)

In [3]:
cleaned_file

[['MANY:', 'APPLAUSE'],
 ['PERROT:',
  " Good afternoon ladies and gentlemen Hx,  It's uh,  truly a historic occasion  today,  to have  among us,  a citizen who has contributed so much,  not only to this  entire area,  but t- to the world of ceramics,  and also to the world of literature,  and been part ,"],
 ['PERROT:',
  'of one of the great  artistic movements,  which  revolutionized,  the way  art was considered,  and which today still has,  extraordinary resonance,  in the work  of younger artists,  who are  rediscovering,  something that was discovered,  at the v- -- almost at the beginning of this century,  and with which  Miss Wood,  was so intimately involved'],
 ['WOOD:', ' P XXX P?'],
 ['PERROT:', ' To have  the Mother  of Dada  with us,'],
 ['MANY:', 'LAUGHTER_AND_APPLAUSE'],
 ['WOOD:', "X Let's keep X XX, Is this it"],
 ['PERROT:',
  "is in  deed a pleasure N- now, I would like to  hold forth, but  I don't think I'm going to be allowed to"],
 ['MANY:', ' LAUGHTER'],
 ['PER

## Ex 3 - Turn it into a function!

Paste the code from exercise 2 and turn it into a function. 

You should choose how to call your function, what parameters are required (if any) and what it should return. Also write the documentation for the function.

In [None]:
# TODO

## Ex 4 - Who said what?

As we work with a transcription file, there are different speakers, each producing many utterances. We want to use this information to try and see who said what, and who spoke the most.

In order to do that, we will create a dictionary. It should contain what each person said. So the keys will be the speakers' names, and the values will be a list of utterances.

Work from the output produced by exercise 3 (or exercise 2, as the outputs should be the same for both exercises) and create this dictionary!

Example: 
```
{'Student_A': ['I have a question',
               'Okay, thank you',
               "I don't understand ex 4"],
 'Student_B': ["I'm tired", 'Let me help you'],
 'Teacher': ['Hello', "Let's work"]}
```

In [None]:
# TODO

#### Don't forget to answer the question: who spoke the most? 

Use the dictionary you just created to find the answer!

If you couldn't do it, just create a dictionary manually with the name of the speakers and random utterances (or use the one provided as an example for the previous question).

In [None]:
# TODO

## Ex 5 - Compare this transcription to another one!

Choose another transcription file from https://linguistics.ucsb.edu/research/santa-barbara-corpus-spoken-american-english and download it (choose the "TRN" file!).
    
Run all the functions from previous exercises to compare the transcriptions. What differences can you observe? (Time to use your linguistics skills!)

In [None]:
# TODO

## Ex 6 - Visualize your data

Use the results from previous exercises to create some plots and visualize your results. You can decide what data you want to visualize. 

For example, you could plot the number of words from SBC055 and the number of words from the file you chose in ex 5 to compare the lexical diversity.

If you didn't do ex 5, you could plot the number of total words vs. unique words to see how lexically rich the utterances are. You could also plot the number of utterances per speaker, to give a more visual answer to the question "Who speaks the most?" from ex 4.

In [None]:
# TODO

# Example of how to use matplotlib (from TD2)
import matplotlib.pyplot as plt

xs = ["Karën", "Amandine", "Clémentine", "Fanny"]   # a list of names that will be used as X axis
ys = [10, 55, 66, 88]           # a list of nbs, corresponding to the xs list, to be used as Y axis

plt.bar(xs, ys)
plt.show()
# Make sure to close the plt object once done
plt.close()