Publicité
Publicité

Contenu connexe

Publicité

علم البيانات - Data Sience

  1. | @Apptrainers
  2. • Introduction to Python • Numpy • Pandas | @Apptrainers content
  3. | @Apptrainers
  4. “In December 1989, I was looking for a "hobby" programming project that would keep me occupied during the week around Christmas. My office ... would be closed, but I had a home computer, and not much else on my hands. I decided to write an interpreter for the new scripting language I had been thinking about lately: a descendant of ABC that would appeal to Unix/C hackers. I chose Python as a working title for the project, being in a slightly irreverent mood (and a big fan of Monty Python's Flying Circus).” — Guido van Rossum 4| @Apptrainers
  5.  The big technology companies have each largely aligned themselves with different languages stacks.  Oracle and IBM are aligned with Java (Oracle actually owns Java).  Google are known for their use of Python (1997), a very versatile, dynamic and extensible language, although in reality they are also heavy users of C++ and Java. They have also created their own language called Go (2009). 5| @Apptrainers
  6.  Easy to learn and powerful programming language  It has efficient high-level data structures and a simple but effective approach to object- oriented programming.  Freely available in source or binary form for all major platforms from the Python Web site, https://www.python.org/ The Python interpreter is easily extended with new functions and data types implemented in C or C++ (or other languages callable from C). Python is also suitable as an extension language for customizable applications. Widely used (Google, NASA, Quora). 6 | @Apptrainers
  7. When you run python program an interpreter will parse python program line by line basis, as compared to compiled languages like C or C++, where compiler first compiles the program and then start running. Difference is that interpreted languages are little bit slow as compared to compiled languages. 7| @Apptrainers
  8.  In python you don’t need to define variable data type ahead of time, python automatically guesses the data type of the variable based on the type of value it contains. 8| @Apptrainers
  9. Python codes are usually 1/3 or 1/5 of the java code. It means we can write less code in Python to achieve the same thing as in Java. 9| @Apptrainers
  10.  There are many good options for saving and manipulating code Sublime text (unlimited free trial available) Notepad++ Xcode (Mac) TextWrangler (Mac) TextEdit (Mac)  Now there are multiple platforms for taking online courses for free Coursera Edx Stanford Online Khan Academy Udacity | @Apptrainers 10
  11.  To download Python follow the instructions on the official website! https://www.python.org/ 11| @Apptrainers
  12. I would strongly recommend this video: https://www.youtube.com/watch?v=HW29067qVWk 12| @Apptrainers
  13. 13| @Apptrainers
  14. https://git-scm.com/book/en/v2/Getting-Started-Installing-Git https://github.com 14| @Apptrainers
  15. “GitHub is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere”. GitHub accounts can be public (free) or private (not free) A repository is usually used to organize a single project, It contains folders and files, images, videos, spreadsheets, and data sets – anything your project needs. 15| @Apptrainers
  16. Master in a repository:The final version Branch:To try out new ideas that don’t affect the master unless pull request is accepted. Any changes committed to branch reflects for you to keep track of different versions Adding Commits:To Keep track (history) of user progress on a branch or master. Forking a repository: creates a copy of Repo. Submit a pull request to owner so that the owner can incorporate changes. 16| @Apptrainers
  17.  Download Python and Jupyter Notebook  Write a python code to print your name, your id, and your favorite quote!  Save the project as .html and as .ipynb  Install git and create a GitHub account  Upload your first project as .html to e-learning  Upload your first project as .ipynb to your Github account Share the link of your Github with me on e-learning 17| @Apptrainers
  18. https://www.tutorialspoint.com/execute_python_online.php https://www.onlinegdb.com/online_python_compiler 18| @Apptrainers
  19. You can type things directly into a running Python session 19| @Apptrainers
  20. Most of the programming languages like C, C++, Java use braces { } to define a block of code. Python uses indentation. A code block (body of a function, loop etc.) starts with indentation and ends with the first unindented line.The amount of indentation is up to you, but it must be consistent throughout that block. Generally four whitespaces are used for indentation and is preferred over tabs. Here is an example. for i in range(1,11): print(i) if i == 5: break Incorrect indentation will result into IndentationError. 20| @Apptrainers
  21. In Python, we use the hash (#) symbol to start writing a comment. It extends up to the newline character. Comments are for programmers for better understanding of a program. Python Interpreter ignores comment. #This is a comment #print out Hello print('Hello’) If we have comments that extend multiple lines, one way of doing it is to use hash (#) in the beginning of each line. Another way of doing this is to use triple quotes, either ’‘ ' or ” ” ". These triple quotes are generally used for multi-line strings. But they can be used as multi-line comment as well. """This is also a perfect example of multi-line comments""" 21| @Apptrainers
  22. expression: A data value or set of operations to compute a value. Examples: 1 + 4 * 3 42 Arithmetic operators we will use: + - * / addition, subtraction, multiplication, division % modulus, a.k.a. remainder ** exponentiation precedence: Order in which operations are computed. * / % ** have a higher precedence than + - 1 + 3 * 4 is 13 Parentheses can be used to force a certain order of evaluation. (1 + 3) * 4 is 16 Operat or Description Example = Assignment num = 7 + Addition num = 2 + 2 - Subtraction num = 6 - 4 * Multiplication num = 5 * 4 / Division num = 25 / 5 % Modulo num = 8 % 3 ** Exponent num = 9 ** 2 22| @Apptrainers
  23. When we divide integers with / , the quotient is also an integer.  35 / 5 is 7  84 / 10 is 8  156 / 100 is 1 The % operator computes the remainder from a division of integers.  The operators + - * / % ** ( ) all work for real numbers.  The / produces an exact answer: 15.0 / 2.0 is 7.5  The same rules of precedence also apply to real numbers: Evaluate ( ) before * / % before + -  When integers and reals are mixed, the result is a real number.  Example: 1 / 2.0 is 0.5 The conversion occurs on a per-operator basis 7 / 3 * 1.2 + 3 / 2 2 * 1.2 + 3 / 2 2.4 + 3 / 2 2.4 + 1 3.4 23| @Apptrainers
  24. Python has useful commands for performing calculations. Command name Description abs(value) absolute value ceil(value) rounds up cos(value) cosine, in radians floor(value) rounds down log(value) logarithm, base e log10(value) logarithm, base 10 max(value1, value2) larger of two values min(value1, value2) smaller of two values round(value) nearest whole number sin(value) sine, in radians sqrt(value) square root Constant Description e 2.7182818... pi 3.1415926... To use many of these commands, you must write the following at the top of your Python program: from math import * 24| @Apptrainers
  25. variable: A named piece of memory that can store a value. Usage:  Compute an expression's result,  store that result into a variable,  and use that variable later in the program. assignment statement: Stores a value into a variable. Syntax: name = value Examples: x = 5 gpa = 3.14 x 5 gpa 3.14 A variable that has been given a value can be used in expressions. x + 4 is 9 Exercise: Evaluate the quadratic equation for a given a, b, and c. 25| @Apptrainers
  26.  print : Produces text output on the console.  Syntax: print ("Message”) print (Expression)  Prints the given text message or expression value on the console, and moves the cursor down to the next line. print (Item1, Item2, ..., ItemN)  Prints several messages and/or expressions on the same line.  Examples: print ("Hello, world!”) age = 45 print ("You have", 65 - age, "years until retirement”) Output: Hello, world! You have 20 years until retirement 26| @Apptrainers
  27.  input : Reads a number from user input.  You can assign (store) the result of input into a variable.  Example: age = input("How old are you? ") print ("Your age is", age) print ("You have", 65 - age, "years until retirement”) Output: How old are you? 53 Your age is 53 You have 12 years until retirement  Exercise: Write a Python program that prompts the user for his/her amount of money, then reports how many Nintendo Wiis the person can afford, and how much more money he/she will need to afford an additional Wii. 27| @Apptrainers
  28. for loop: Repeats a set of statements over a group of values.  Syntax: for variableName in groupOfValues: statements  We indent the statements to be repeated with tabs or spaces.  variableName gives a name to each value, so you can refer to it in the statements.  groupOfValues can be a range of integers, specified with the range function.  Example: for x in range(1, 6): print (x, "squared is", x * x) Output: 1 squared is 1 2 squared is 4 3 squared is 9 4 squared is 16 5 squared is 25 28| @Apptrainers
  29. 29| @Apptrainers
  30. The range function specifies a range of integers:  range(start, stop) - the integers between start (inclusive) and stop (exclusive) It can also accept a third value specifying the change between values.  range(start, stop, step) - the integers between start (inclusive) and stop (exclusive) by step Example: for x in range(5, 0, -1): print (x) print (”Hello!”) Output: 5 4 3 2 1 Hello! 30| @Apptrainers
  31.  Some loops incrementally compute a value that is initialized outside the loop. This is sometimes called a cumulative sum. sum = 0 for i in range(1, 11): sum = sum + (i * i) print ("sum of first 10 squares is", sum) Output: sum of first 10 squares is 385 Exercise: Write a Python program that computes the factorial of an integer. 31| @Apptrainers
  32. if statement: Executes a group of statements only if a certain condition is true. Otherwise, the statements are skipped. Syntax: if condition: statements Example: gpa = 3.4 if gpa > 2.0: print ("Your application is accepted.”) 32| @Apptrainers
  33. if/else statement: Executes one block of statements if a certain condition is True, and a second block of statements if it is False.  Syntax: if condition: statements else: statements Example: gpa = 1.4 if gpa > 2.0: print "Welcome to JUST University!" else: print "Your application is denied." Multiple conditions can be chained with elif ("else if"): if condition: statements elif condition: statements else: statements 33| @Apptrainers
  34. while loop: Executes a group of statements as long as a condition is True. good for indefinite loops (repeat an unknown number of times) Syntax: while condition: statements Example: number = 1 while number < 200: print number, number = number * 2 Output: 1 2 4 8 16 32 64 128 34| @Apptrainers
  35. Many logical expressions use relational operators: Logical expressions can be combined with logical operators: Exercise: Write code to display and count the factors of a number. Operator Example Result and 9 != 6 and 2 < 3 True or 2 == 3 or -1 < 5 True not not 7 > 0 False Operator Meaning Example Result == equals 1 + 1 == 2 True != does not equal 3.2 != 2.5 True < less than 10 < 5 False > greater than 10 > 5 True <= less than or equal to 126 <= 100 False >= greater than or equal to 5.0 >= 5.0 True 35| @Apptrainers
  36.  string: A sequence of text characters in a program.  Strings start and end with quotation mark " or apostrophe ' characters.  Examples: "hello" "This is a string" "This, too, is a string. It can be very long!"  A string may not span across multiple lines or contain a " character. "This is not a legal String." "This is not a "legal" String either."  A string can represent characters by preceding them with a backslash.  t tab character  n new line character  " quotation mark character  backslash character  Example: "HellottherenHow are you?" 36| @Apptrainers
  37.  Characters in a string are numbered with indexes starting at 0:  Example: name = "P. Diddy"  Accessing an individual character of a string: variableName [ index ]  Example: print name, "starts with", name[0] Output: P. Diddy starts with P index 0 1 2 3 4 5 6 7 character P . D i d d y 37| @Apptrainers
  38. len(string) - number of characters in a string (including spaces) str.lower(string) - lowercase version of a string str.upper(string) - uppercase version of a string Example: name = "Martin Douglas Stepp" length = len(name) big_name = str.upper(name) print big_name, "has", length, "characters" Output: MARTIN DOUGLAS STEPP has 20 characters 38| @Apptrainers
  39. A compound data type: [0] [2.3, 4.5] [5, "Hello", "there", 9.8] [] Use len() to get the length of a list >>> names = [“Ben",“Chen",“Yaqin"] >>> len(names) 3 39| @Apptrainers
  40. 40| @Apptrainers
  41. 41| @Apptrainers
  42. http://sebastianraschka.com/Articles/2014_python_2_3_key_diff.html 42| @Apptrainers
  43. Certain features of Python are not loaded by default In order to use these features, you’ll need to import the modules that contain them. E.g. import matplotlib.pyplot as plt import numpy as np 43| @Apptrainers
  44. f = 7 / 2 # in python 2, f will be 3, unless “from __future__ import division” f = 7 / 2 # in python 3 f = 3.5 f = 7 // 2 # f = 3 in both python 2 and 3 f = 7 / 2. # f = 3.5 in both python 2 and 3 f = 7 / float(2) # f is 3.5 in both python 2 and 3 f = int(7 / 2) # f is 3 in both python 2 and 3 44| @Apptrainers
  45.  Get the i-th element of a list x = [i for i in range(10)] # is the list [0, 1, ..., 9] zero = x[0] # equals 0, lists are 0-indexed one = x[1] # equals 1 nine = x[-1] # equals 9, 'Pythonic' for last element eight = x[-2] # equals 8, 'Pythonic' for next-to-last element one_to_four = x[1:5] # [1, 2, 3, 4] first_three = x[:3] # [0, 1, 2] last_three = x[-3:] # [7, 8, 9] three_to_end = x[3:] # [3, 4, ..., 9] without_first_and_last = x[1:-1] # [1, 2, ..., 8] copy_of_x = x[:] # [0, 1, 2, ..., 9] another_copy_of_x = x[:3] + x[3:] # [0, 1, 2, ..., 9] 45| @Apptrainers
  46. 1 in [1, 2, 3] # True 0 in [1, 2, 3] # False x = [1, 2, 3] y = [4, 5, 6] x.extend(y) # x is now [1,2,3,4,5,6] x = [1, 2, 3] y = [4, 5, 6] z = x + y # z is [1,2,3,4,5,6]; x is unchanged. x, y = [1, 2] # x is 1 and y is 2 [x, y] = 1, 2 # same as above x, y = [1, 2] # same as above x, y = 1, 2 # same as above _, y = [1, 2] # y is 2, didn't care about the first element 46| @Apptrainers
  47. >>> a = ['Mary', 'had', 'a', 'little', 'lamb'] >>> for i in range(len(a)): ... print(i, a[i]) ... 0 Mary 1 had 2 a 3 little 4 lamb 47| @Apptrainers
  48. What are the expected output for the following code? a = list(range(10)) b = a b[0] = 100 print(a) a = list(range(10)) b = a[:] b[0] = 100 print(a) [100, 1, 2, 3, 4, 5, 6, 7, 8, 9] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] a = [0, 1, 2, 3, 4] b = a c = a[:] a == b Out[129]: True a is b Out[130]: True a == c Out[132]: True a is c Out[133]: False 48| @Apptrainers
  49. Similar to lists, but are immutable a_tuple = (0, 1, 2, 3, 4) Other_tuple = 3, 4 Another_tuple = tuple([0, 1, 2, 3, 4]) Hetergeneous_tuple = (‘john’, 1.1, [1, 2]) Can be sliced, concatenated, or repeated a_tuple[2:4] # will print (2, 3) Cannot be modified a_tuple[2] = 5 TypeError: 'tuple' object does not support item assignment Note: tuple is defined by comma, not parentheses, which is only used for convenience and grouping elements. So a = (1) is not a tuple, but a = (1,) is. 49| @Apptrainers
  50. Useful for returning multiple values from functions Tuples and lists can also be used for multiple assignments def sum_and_product(x, y): return (x + y),(x * y) sp = sum_and_product(2, 3) # equals (5, 6) s, p = sum_and_product(5, 10) # s is 15, p is 50 x, y = 1, 2 [x, y] = [1, 2] (x, y) = (1, 2) x, y = y, x 50| @Apptrainers
  51. a = [1, 2, 3, 4, 5, 6] my_tuple=(a,) my_tuple[0]=a #### ERROR a = [1, 2, 3, 4, 5, 6] my_tuple=(a) my_tuple[0]=a #### No ERROR a = [1, 2, 3, 4, 5, 6] my_tuple=(a,) my_tuple[0]=5 #### ERROR a = [1, 2, 3, 4, 5, 6] my_tuple=(a,) my_tuple[0][0]=5 #### No ERROR 51| @Apptrainers
  52. A dictionary associates values with unique keys empty_dict = {} # Pythonic empty_dict2 = dict() # less Pythonic grades = { "Joel" : 80, "Tim" : 95 } # dictionary literal joels_grade = grades["Joel"] # equals 80 grades["Tim"] = 99 # replaces the old value grades["Kate"] = 100 # adds a third entry num_students = len(grades) # equals 3 • Access/modify value with key try: kates_grade = grades["Kate"] except KeyError: print "no grade for Kate!" 52| @Apptrainers
  53. 53| @Apptrainers
  54. Check for existence of key joel_has_grade = "Joel" in grades # True kate_has_grade = "Kate" in grades # False joels_grade = grades.get("Joel", 0) # equals 80 kates_grade = grades.get("Kate", 0) # equals 0 no_ones_grade = grades.get("No One") # default default is None • Use “get” to avoid keyError and add default value • Get all items all_keys = grades.keys() # return a list of all keys all_values = grades.values() # return a list of all values all_pairs = grades.items() # a list of (key, value) tuples #Which of the following is faster? 'Joel' in grades # faster. Hashtable 'Joel' in all_keys # slower. List. In python3,The following will not return lists but iterable objects 54| @Apptrainers
  55. a = [0, 0, 0, 1] any(a) Out[135]: True all(a) Out[136]: False 55| @Apptrainers
  56. try: print 0 / 0 except ZeroDivisionError: print ("cannot divide by zero") https://docs.python.org/3/tutorial/errors.ht ml 56| @Apptrainers
  57. Functions are defined using def def double(x): """this is where you put an optional docstring that explains what the function does. for example, this function multiplies its input by 2""" return x * 2 • You can call a function after it is defined z = double(10) # z is 20 • You can give default values to parameters def my_print(message="my default message"): print (message) my_print("hello") # prints 'hello' my_print() # prints 'my default message‘ 57| @Apptrainers
  58. Sometimes it is useful to specify arguments by name def subtract(a=0, b=0): return a – b subtract(10, 5) # returns 5 subtract(0, 5) # returns -5 subtract(b = 5) # same as above subtract(b = 5, a = 0) # same as above 58| @Apptrainers
  59. Functions are objects too In [12]: def double(x): return x * 2 ...: DD = double; ...: DD(2) ...: Out[12]: 4 In [16]: def apply_to_one(f): ...: return f(1) ...: x=apply_to_one(DD) ...: x ...: Out[16]: 2 59| @Apptrainers
  60. Small anonymous functions can be created with the lambda keyword. The power of lambda is better shown when you use them as an anonymous function inside another function. def myfunc(n): return lambda a : a * n mydoubler = myfunc(2) mytripler = myfunc(3) print(mydoubler(11)) print(mytripler(11)) A lambda function can take any number of arguments, but can only have one expression. x = lambda a : a + 10 print(x(5)) x = lambda a, b, c : a * b - c print(x(5, 6, 2)) 60| @Apptrainers
  61. pairs = [(2, 'two'), (3, 'three'), (1, 'one'), (4, 'four')] pairs.sort(key = lambda pair: pair[0]) print (pairs) Out[22]: [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four')] def getKey(pair): return pair[0] pairs.sort(key=getKey) print (pairs) Out[107]: [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four') 61| @Apptrainers
  62. A very convenient way to create a new list squares = [x * x for x in range(5)] print (squares) Out[52]: [0, 1, 4, 9, 16] squares=[0,0,0,0,0] for x in range(5): squares[x] = x * x print (squares) Out[64]: [0, 1, 4, 9, 16] 62| @Apptrainers
  63. In [68]: even_numbers = [] In [69]: for x in range(5): ...: if x % 2 == 0: ...: even_numbers.append(x) ...: even_numbers Out[69]: [0, 2, 4] In [65]: even_numbers = [x for x in range(5) if x % 2 == 0] In [66]: even_numbers Out[66]: [0, 2, 4] Can also be used to filter list 63| @Apptrainers
  64. More complex examples: # create 100 pairs (0,0) (0,1) ... (9,8), (9,9) pairs = [(x, y) for x in range(10) for y in range(10)] # only pairs with x < y, # range(lo, hi) equals # [lo, lo + 1, ..., hi - 1] increasing_pairs = [(x, y) for x in range(10) for y in range(x + 1, 10)] [(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8), (0, 9), (1, 2), (1, 3) …etc 64| @Apptrainers
  65. Convenient tools in python to apply function to sequences of data def double(x): return 2*x b=range(5) list(map(double, b)) Out[203]: [0, 2, 4, 6, 8] In [204]: double(b) Traceback (most recent call last): TypeError: unsupported operand type(s) for *: 'int' and 'range' def double(x): return 2*x print ([double(i) for i in range(5)]) Out[205]: [0, 2, 4, 6, 8] 65| @Apptrainers
  66. map_output = map(lambda x: x*2, [1, 2, 3, 4]) print(map_output) # Output: map object: <map object at 0x04D6BAB0> list_map_output = list(map_output) print(list_map_output) # Output: [2, 4, 6, 8] map(lambda x : x*2, [1, 2, 3, 4]) #Output [2, 4, 6, 8] map(lambda x, y: x + y, list_a, list_b) # Output: [11, 22, 33] 66| @Apptrainers
  67. def is_even(x): return x%2==0 a=[0, 1, 2, 3] list(filter(is_even, a)) Out[208]: [0, 2] In [209]: [a[i] for i in a if is_even(i)] Out[209]: [0, 2] a = [1, 2, 3, 4, 5, 6] print list(filter(lambda x : x % 2 == 0, a)) # Output: [2, 4, 6] 67| @Apptrainers
  68. In [216]: from functools import reduce In [217]: reduce(lambda x, y: x+y, range(10)) Out[217]: 45 In [220]: reduce(lambda x, y: x*y, [1, 2, 3, 4]) Out[220]: 24 68| @Apptrainers
  69. Useful to combined multiple lists into a list of tuples In [238]: list(zip(['a', 'b', 'c'], [1, 2, 3], ['A', 'B', 'C'])) Out[238]: [('a', 1, 'A'), ('b', 2, 'B'), ('c', 3, 'C')] In [245]: names = ['James', 'Tom', 'Mary'] ...: grades = [100, 90, 95] ...: list(zip(names, grades)) ...: Out[245]: [('James', 100), ('Tom', 90), ('Mary', 95)] 69| @Apptrainers
  70.  file object = open(file_name [, access_mode]) access_mode − The access_mode determines the mode in which the file has to be opened, i.e., read, write, append, etc. A complete list of possible values is given below in the table.This is optional parameter and the default file access mode is read (r). 70| @Apptrainers
  71. 71| @Apptrainers
  72. read(): It reads the entire file and returns it contents in the form of a string readline(): It reads the first line of the file i.e till a newline character or an EOF in case of a file having a single line and returns a string readlines(): It reads the entire file line by line and returns a list of line strings 1 hello 40 50 hi This is my course Welcome to this course n wish you all the best f = open("my_file2.txt", 'w') f.write("Hello Everyone!") 72| @Apptrainers
  73. Notice how each piece of data is separated by a comma. 73| @Apptrainers
  74. 74| @Apptrainers
  75. | @Apptrainers
  76. Numpy Numerical Computing in Python 2
  77. What is Numpy? • Numpy, Scipy, and Matplotlib provide MATLAB- like functionality in python. • Numpy Features:  Typed multidimentional arrays (matrices)  Fast numerical computations (matrix math)  High-level math functions 3 |@Apptrainers
  78. Why do we need NumPy Let’s see for ourselves! 4 |@Apptrainers
  79. Why do we need NumPy • Python does numerical computations slowly. • 1000 x 1000 matrix multiply  Python triple loop takes > 10 min.  Numpy takes ~0.03 seconds 5 |@Apptrainers
  80. NumPy Overview 1. Arrays 2. Shaping and transposition 3. Mathematical Operations 4. Indexing and slicing 5. Broadcasting 6 |@Apptrainers
  81. Arrays Structured lists of numbers. • Vectors • Matrices • Images • Tensors • ConvNets 7 |@Apptrainers
  82. Arrays Structured lists of numbers. • Vectors • Matrices • Images • Tensors • ConvNets 𝑝 𝑥 𝑝 𝑦 𝑝 𝑧 𝑎11 ⋯ 𝑎1𝑛 ⋮ ⋱ ⋮ 𝑎 𝑚1 ⋯ 𝑎 𝑚𝑛 8 |@Apptrainers
  83. Arrays Structured lists of numbers. • Vectors • Matrices • Images • Tensors • ConvNets 9 |@Apptrainers
  84. Arrays Structured lists of numbers. • Vectors • Matrices • Images • Tensors • ConvNets 10 |@Apptrainers
  85. Arrays Structured lists of numbers. • Vectors • Matrices • Images • Tensors • ConvNets 11 |@Apptrainers
  86. Arrays, Basic Properties import numpy as np a = np.array([[1,2,3],[4,5,6]],dtype=np.float32) print a.ndim, a.shape, a.dtype 1. Arrays can have any number of dimensions, including zero (a scalar). 2. Arrays are typed: np.uint8, np.int64, np.float32, np.float64 3. Arrays are dense. Each element of the array exists and has the same type. 12 |@Apptrainers
  87. Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 13 |@Apptrainers
  88. Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 14 |@Apptrainers
  89. Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 15 |@Apptrainers
  90. Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 16 |@Apptrainers
  91. Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 17 |@Apptrainers
  92. Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 18 |@Apptrainers
  93. Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 19 |@Apptrainers
  94. 20 |@Apptrainers
  95. Arrays, creation • np.ones, np.zeros • np.arange • np.concatenate • np.astype • np.zeros_like, np.ones_like • np.random.random 21 |@Apptrainers
  96. Arrays, danger zone • Must be dense, no holes. • Must be one type • Cannot combine arrays of different shape 22 |@Apptrainers
  97. Shaping a = np.array([1,2,3,4,5,6]) a = a.reshape(3,2) a = a.reshape(2,-1) a = a.ravel() 1. Total number of elements cannot change. 2. Use -1 to infer axis shape 3. Row-major by default (MATLAB is column-major) 23 |@Apptrainers
  98. import numpy as np a = np.array([1,2,3,4,5,6]) print(a) print('-'*20) b=a.reshape(3,2) print(b) print('-'*20) c=a.reshape(2,-1) print(c) print('-'*20) d= a.ravel() print(d) 24 |@Apptrainers
  99. 25 |@Apptrainers
  100. Return values • Numpy functions return either views or copies. • Views share data with the original array, like references in Java/C++. Altering entries of a view, changes the same entries in the original. • The numpy documentation says which functions return views or copies • np.copy, np.view make explicit copies and views. 26 |@Apptrainers
  101. Transposition a = np.arange(10).reshape(5,2) a = a.T a = a.transpose((1,0)) np.transpose permutes axes. a.T transposes the first two axes. 27 |@Apptrainers
  102. 28 |@Apptrainers
  103. 29 |@Apptrainers
  104. Saving and loading arrays np.savez(‘data.npz’, a=a) data = np.load(‘data.npz’) a = data[‘a’] 1. NPZ files can hold multiple arrays 2. np.savez_compressed similar. 30 |@Apptrainers
  105. Mathematical operators • Arithmetic operations are element-wise • Logical operator return a bool array • In place operations modify the array 31 |@Apptrainers
  106. Mathematical operators • Arithmetic operations are element-wise • Logical operator return a bool array • In place operations modify the array 32 |@Apptrainers
  107. Mathematical operators • Arithmetic operations are element-wise • Logical operator return a bool array • In place operations modify the array 33 |@Apptrainers
  108. Mathematical operators • Arithmetic operations are element-wise • Logical operator return a bool array • In place operations modify the array 34 |@Apptrainers
  109. Math, upcasting Just as in Python and Java, the result of a math operator is cast to the more general or precise datatype. uint64 + uint16 => uint64 float32 / int32 => float32 Warning: upcasting does not prevent overflow/underflow. You must manually cast first. Use case: images often stored as uint8. You should convert to float32 or float64 before doing math. 35 |@Apptrainers
  110. Math, universal functions Also called ufuncs Element-wise Examples:  np.exp  np.sqrt  np.sin  np.cos  np.isnan 36 |@Apptrainers
  111. Math, universal functions Also called ufuncs Element-wise Examples:  np.exp  np.sqrt  np.sin  np.cos  np.isnan 37 |@Apptrainers
  112. Math, universal functions Also called ufuncs Element-wise Examples:  np.exp  np.sqrt  np.sin  np.cos  np.isnan 38 |@Apptrainers
  113. Indexing x[0,0] # top-left element x[0,-1] # first row, last column x[0,:] # first row (many entries) x[:,0] # first column (many entries) Notes:  Zero-indexing  Multi-dimensional indices are comma-separated (i.e., a tuple) 39 |@Apptrainers
  114. 40 |@Apptrainers
  115. Python Slicing Syntax: start:stop:step a = list(range(10)) a[:3] # indices 0, 1, 2 a[-3:] # indices 7, 8, 9 a[3:8:2] # indices 3, 5, 7 a[4:1:-1] # indices 4, 3, 2 (this one is tricky) 41 |@Apptrainers
  116. 42 |@Apptrainers
  117. Axes a.sum() # sum all entries a.sum(axis=0) # sum over rows a.sum(axis=1) # sum over columns a.sum(axis=1, keepdims=True) 1. Use the axis parameter to control which axis NumPy operates on 2. Typically, the axis specified will disappear, keepdims keeps all dimensions 43 |@Apptrainers
  118. 44 |@Apptrainers
  119. Broadcasting a = a + 1 # add one to every element When operating on multiple arrays, broadcasting rules are used. Each dimension must match, from right-to-left 1. Dimensions of size 1 will broadcast (as if the value was repeated). 2. Otherwise, the dimension must have the same shape. 3. Extra dimensions of size 1 are added to the left as needed. 45 |@Apptrainers
  120. Broadcasting example Suppose we want to add a color value to an image a.shape is 100, 200, 3 b.shape is 3 a + b will pad b with two extra dimensions so it has an effective shape of 1 x 1 x 3. So, the addition will broadcast over the first and second dimensions. 46 |@Apptrainers
  121. Broadcasting failures If a.shape is 100, 200, 3 but b.shape is 4 then a + b will fail. The trailing dimensions must have the same shape (or be 1) 47 |@Apptrainers
  122. Tips to avoid bugs 1. Know what your datatypes are. 2. Check whether you have a view or a copy. 3. Know np.dot vs np.multiply. 48 |@Apptrainers
  123. 49 numpy.dot numpy.dot(a, b, out=None) Dot product of two arrays. Specifically, • If both a and b are 1-D arrays, it is inner product of vectors (without complex conjugation). • If both a and b are 2-D arrays, it is matrix multiplication, but using matmul or a @ b is preferred. • If either a or b is 0-D (scalar), it is equivalent to multiply and using numpy.multiply(a, b) or a * b is preferred. • If a is an N-D array and b is a 1-D array, it is a sum product over the last axis of a and b. • If a is an N-D array and b is an M-D array (where M>=2), it is a sum product over the last axis of a and the second-to-last axis of b: dot(a, b)[i,j,k,m] = (a[i,j,:] * b[k,:,m]) |@Apptrainers
  124. 50 |@Apptrainers
  125. 51 Numpy.multiply |@Apptrainers
  126. 52 |@Apptrainers
  127. | @Apptrainers
  128. What is Pandas? Pandas is a Python module, which is rounding up the capabilities of Numpy, Scipy and Matplotlab. The word pandas is an acronym which is derived from: "Python and data analysis" and "panel data". There is often some confusion about whether Pandas is an alternative to Numpy, SciPy and Matplotlib. The truth is that it is built on top of Numpy. This means that Numpy is required by pandas. Scipy and Matplotlib on the other hand are not required by pandas but they are extremely useful. That's why the Pandas project lists them as "optional dependency". | @Apptrainers
  129. What is Pandas? • Pandas is a software library written for the Python programming language. • It is used for data manipulation and analysis. • It provides special data structures and operations for the manipulation of numerical tables and time series. | @Apptrainers| @Apptrainers
  130. Common Data Structures in Pandas • Series • Data Frame | @Apptrainers| @Apptrainers
  131. Series • A Series is a one-dimensional labelled array-like object. • It is capable of holding any data type, e.g. integers, floats, strings, Python objects, and so on. • It can be seen as a data structure with two arrays: one functioning as the index, i.e. the labels, and the other one contains the actual data. | @Apptrainers
  132. Example import pandas as pd S = pd.Series([11, 28, 72, 3, 5, 8]) S The above code returns: 0 11 1 28 2 72 3 3 4 5 5 8 dtype: int64 | @Apptrainers
  133. • We can directly access the index and the values of our Series S: print(S.index) print(S.values) RangeIndex(start=0, stop=6, step=1) [11 28 72 3 5 8] | @Apptrainers
  134. • If we compare this to creating an array in numpy, there are still lots of similarities: import numpy as np X = np.array([11, 28, 72, 3, 5, 8]) print(X) print(S.values) # both are the same type: print(type(S.values), type(X)) [11 28 72 3 5 8] [11 28 72 3 5 8] <class 'numpy.ndarray'> <class 'numpy.ndarray'> | @Apptrainers
  135. Another example: fruits = ['apples', 'oranges', 'cherries', 'pears'] quantities = [20, 33, 52, 10] S = pd.Series(quantities, index=fruits) S Output: apples 20 oranges 33 cherries 52 pears 10 dtype: int64 | @Apptrainers
  136. If we add two series with the same indices, we get a new series with the same index and the corresponding values will be added: fruits = ['apples', 'oranges', 'cherries', 'pears'] S = pd.Series([20, 33, 52, 10], index=fruits) S2 = pd.Series([17, 13, 31, 32], index=fruits) print(S + S2) print(“sum of S: ", sum(S)) Output: apples 37 oranges 46 cherries 83 pears 42 dtype: int64 sum of S: 115 | @Apptrainers
  137. The indices do not have to be the same for the Series addition. The index will be the "union" of both indices. If an index doesn't occur in both Series, the value for this Series will be NaN: fruits = ['peaches', 'oranges', 'cherries', 'pears'] fruits2 = ['raspberries', 'oranges', 'cherries', 'pears'] S = pd.Series([20, 33, 52, 10], index=fruits) S2 = pd.Series([17, 13, 31, 32], index=fruits2) print(S + S2) Output: cherries 83.0 oranges 46.0 peaches NaN pears 42.0 raspberries NaN dtype: float64 | @Apptrainers
  138. fruits = ['apples', 'oranges', 'cherries', 'pears'] fruits_ro = ["mere", "portocale", "cireșe", "pere"] S = pd.Series([20, 33, 52, 10], index=fruits) S2 = pd.Series([17, 13, 31, 32], index=fruits_ro) print(S+S2) Output: apples NaN cherries NaN cireșe NaN mere NaN oranges NaN pears NaN pere NaN portocale NaN dtype: float64 | @Apptrainers
  139. It's possible to access single values of a Series or more than one value by a list of indices: print(S['apples']) 20 print(S[['apples', 'oranges', 'cherries']]) apples 20 oranges 33 cherries 52 dtype: int64 | @Apptrainers
  140. Similar to Numpy we can use scalar operations or mathematical functions on a series: import numpy as np print((S + 3) * 4) print("======================") print(np.sin(S)) Output: apples 92 oranges 144 cherries 220 pears 52 dtype: int64 ====================== apples 0.912945 oranges 0.999912 cherries 0.986628 pears -0.544021 dtype: float64 | @Apptrainers
  141. Pandas.Series.Apply Series.apply(func, convert_dtype=True, args=(), **kwds) Parameter Meaning func a function, which can be a NumPy function that will be applied to the entire Series or a Python function that will be applied to every single value of the series convert_dtype A boolean value. If it is set to True (default), apply will try to find better dtype for elementwise function results. If False, leave as dtype=object args Positional arguments which will be passed to the function "func" additionally to the values from the series. **kwds Additional keyword arguments will be passed as keywords to the function | @Apptrainers
  142. S.apply(np.sin) apples 0.912945 oranges 0.999912 cherries 0.986628 pears -0.544021 dtype: float64 | @Apptrainers
  143. • We can also use Python lambda functions. Let's assume, we have the following task: test the amount of fruit for every kind. If there are less than 50 available, we will augment the stock by 10: S.apply(lambda x: x if x > 50 else x+10 ) apples 30 oranges 43 cherries 52 pears 20 dtype: int64 | @Apptrainers
  144. Filtering with a Boolean array: S[S>30] oranges 33 cherries 52 dtype: int64 | @Apptrainers
  145. • A series can be seen as an ordered Python dictionary with a fixed length. "apples" in S True | @Apptrainers
  146. • We can even pass a dictionary to a Series object, when we create it. We get a Series with the dict's keys as the indices. The indices will be sorted. cities = {"London": 8615246, "Berlin": 3562166, "Madrid": 3165235, "Rome": 2874038, "Paris": 2273305, "Vienna": 1805681, "Bucharest":1803425, "Hamburg": 1760433, "Budapest": 1754000, "Warsaw": 1740119, "Barcelona":1602386, "Munich": 1493900, "Milan": 1350680} city_series = pd.Series(cities) print(city_series)
  147. | @Apptrainers
  148. NaN One problem in dealing with data analysis tasks consists in missing data. Pandas makes it as easy as possible to work with missing data. my_cities = ["London", "Paris", "Zurich", "Berlin", "Stuttgart", "Hamburg"] my_city_series = pd.Series(cities, index=my_cities) my_city_series
  149. | @Apptrainers
  150. • Due to the NaN values the population values for the other cities are turned into floats. There is no missing data in the following examples, so the values are int: my_cities = ["London", "Paris", "Berlin", "Hamburg"] my_city_series = pd.Series(cities, index=my_cities) my_city_series
  151. The Methods isnull() and notnull() my_cities = ["London", "Paris", "Zurich", "Berlin", "Stuttgart", "Hamburg"] my_city_series = pd.Series(cities, index=my_cities) print(my_city_series.isnull()) | @Apptrainers
  152. print(my_city_series.notnull())
  153. • We get also a NaN, if a value in the dictionary has a None: d = {"a":23, "b":45, "c":None, "d":0} S = pd.Series(d) print(S) | @Apptrainers
  154. print(pd.isnull(S)) | @Apptrainers
  155. Print(pd.notnull(S)) | @Apptrainers
  156. Filtering out Missing Data It's possible to filter out missing data with the Series method dropna. It returns a Series which consists only of non-null data: import pandas as pd cities = {"London": 8615246, "Berlin": 3562166, "Madrid": 3165235, "Rome": 2874038, "Paris": 2273305, "Vienna": 1805681, "Bucharest":1803425, "Hamburg": 1760433, "Budapest": 1754000, "Warsaw": 1740119, "Barcelona":1602386, "Munich": 1493900, "Milan": 1350680} my_cities = ["London", "Paris", "Zurich", "Berlin", "Stuttgart", "Hamburg"] my_city_series = pd.Series(cities, index=my_cities) print(my_city_series.dropna()) | @Apptrainers
  157. | @Apptrainers
  158. Filling in Missing Data • In many cases you don't want to filter out missing data, but you want to fill in appropriate data for the empty gaps. A suitable method in many situations will be fillna: print(my_city_series.fillna(0)) London 8615246.0 Paris 2273305.0 Zurich 0.0 Berlin 3562166.0 Stuttgart 0.0 Hamburg 1760433.0 dtype: float64 | @Apptrainers
  159. • If we call fillna with a dictionary, we can provide the appropriate data, i.e. the population of Zurich and Stuttgart: missing_cities = {"Stuttgart":597939, "Zurich":378884} my_city_series.fillna(missing_cities) London 8615246.0 Paris 2273305.0 Zurich 378884.0 Berlin 3562166.0 Stuttgart 597939.0 Hamburg 1760433.0 dtype: float64 | @Apptrainers
  160. cities = {"London": 8615246, "Berlin": 3562166, "Madrid": 3165235, "Rome": 2874038, "Paris": 2273305, "Vienna": 1805681, "Bucharest":1803425, "Hamburg": 1760433, "Budapest": 1754000, "Warsaw": 1740119, "Barcelona":1602386, "Munich": 1493900, "Milan": 1350680} my_cities = ["London", "Paris", "Zurich", "Berlin", "Stuttgart", "Hamburg"] my_city_series = pd.Series(cities, index=my_cities) my_city_series = my_city_series.fillna(0).astype(int) print(my_city_series) | @Apptrainers
  161. London 8615246 Paris 2273305 Zurich 0 Berlin 3562166 Stuttgart 0 Hamburg 1760433 dtype: int64 | @Apptrainers
  162. DataFrame • The underlying idea of a DataFrame is based on spreadsheets. We can see the data structure of a DataFrame as tabular and spreadsheet-like. • A DataFrame logically corresponds to a "sheet" of an Excel document. • A DataFrame has both a row and a column index. | @Apptrainers
  163. • Like a spreadsheet or Excel sheet, a DataFrame object contains an ordered collection of columns. • Each column consists of a unique data type, but different columns can have different types, e.g. the first column may consist of integers, while the second one consists of Boolean values and so on. • There is a close connection between the DataFrames and the Series of Pandas. • A DataFrame can be seen as a concatenation of Series, each Series having the same index, i.e. the index of the DataFrame. | @Apptrainers
  164. import pandas as pd years = range(2014, 2018) shop1 = pd.Series([2409.14, 2941.01, 3496.83, 3119.55], index=years) shop2 = pd.Series([1203.45, 3441.62, 3007.83, 3619.53], index=years) shop3 = pd.Series([3412.12, 3491.16, 3457.19, 1963.10], index=years) print(pd.concat([shop1, shop2, shop3])) | @Apptrainers
  165. | @Apptrainers
  166. • This result is not what we have intended or expected. The reason is that concat used 0 as the default for the axis parameter. Let's do it with "axis=1": shops_df = pd.concat([shop1, shop2, shop3], axis=1) print(shops_df) | @Apptrainers
  167. | @Apptrainers
  168. cities = ["Zürich", "Winterthur", "Freiburg"] shops_df.columns = cities print(shops_df) # alternative way: give names to series: shop1.name = "Zürich" shop2.name = "Winterthur" shop3.name = "Freiburg" print("------") shops_df2 = pd.concat([shop1, shop2, shop3], axis=1) print(shops_df2) | @Apptrainers
  169. | @Apptrainers
  170. print(type(shops_df)) <class 'pandas.core.frame.DataFrame'> | @Apptrainers
  171. DataFrames from Dictionaries cities = {"name": ["London", "Berlin", "Madrid", "Rome", "Paris", "Vienna", "Bucharest", "Hamburg", "Budapest", "Warsaw", "Barcelona", "Munich", "Milan"], "population": [8615246, 3562166, 3165235, 2874038, 2273305, 1805681, 1803425, 1760433, 1754000, 1740119, 1602386, 1493900, 1350680], "country": ["England", "Germany", "Spain", "Italy", "France", "Austria", "Romania", "Germany", "Hungary", "Poland", "Spain", "Germany", "Italy"]} city_frame = pd.DataFrame(cities) print(city_frame) | @Apptrainers
  172. | @Apptrainers
  173. Retrieving the Column Names city_frame.columns.values Output: array(['country', 'name', 'population'], dtype=object) | @Apptrainers
  174. Custom Index • We can see that an index (0,1,2, ...) has been automatically assigned to the DataFrame. We can also assign a custom index to the DataFrame object: ordinals = ["first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eigth", "ninth", "tenth", "eleventh", "twelfth", "thirteenth"] city_frame = pd.DataFrame(cities, index=ordinals) print(city_frame) | @Apptrainers
  175. | @Apptrainers
  176. Rearranging the Order of Columns We can also define and rearrange the order of the columns at the time of creation of the DataFrame. This makes also sure that we will have a defined ordering of our columns, if we create the DataFrame from a dictionary. Dictionaries are not ordered. | @Apptrainers
  177. city_frame = pd.DataFrame(cities, columns=["name", "country", "population"]) print(city_frame) | @Apptrainers
  178. | @Apptrainers
  179. • But what if you want to change the column names and the ordering of an existing DataFrame? city_frame.reindex(["country", "name", "population"]) print(city_frame) | @Apptrainers
  180. | @Apptrainers
  181. • Now, we want to rename our columns. For this purpose, we will use the DataFrame method 'rename'. This method supports two calling conventions • (index=index_mapper, columns=columns_mapper, ...) • (mapper, axis={'index', 'columns'}, ...) • We will rename the columns of our DataFrame into Romanian names in the following example. • We set the parameter inplace to True so that our DataFrame will be changed instead of returning a new DataFrame, if inplace is set to False, which is the default! | @Apptrainers
  182. city_frame.rename(columns={"name":"Nume", "country":"țară", "population":"populație"}, inplace=True) print(city_frame) | @Apptrainers
  183. | @Apptrainers
  184. Existing Column as the Index of a DataFrame • We want to create a more useful index in the following example. We will use the country name as the index, i.e. the list value associated to the key "country" of our cities dictionary: city_frame = pd.DataFrame(cities, columns=["name", "population"], index=cities["country"]) print(city_frame) | @Apptrainers
  185. | @Apptrainers
  186. • Alternatively, we can change an existing DataFrame. • We can use the method set_index to turn a column into an index. • "set_index" does not work in-place, it returns a new data frame with the chosen column as the index: | @Apptrainers
  187. city_frame = pd.DataFrame(cities) city_frame2 = city_frame.set_index("country") print(city_frame2) | @Apptrainers
  188. | @Apptrainers
  189. • We saw in the previous example that the set_index method returns a new DataFrame object and doesn't change the original DataFrame. If we set the optional parameter "inplace" to True, the DataFrame will be changed in place, i.e. no new object will be created: city_frame = pd.DataFrame(cities) city_frame.set_index("country", inplace=True) print(city_frame) | @Apptrainers
  190. | @Apptrainers
  191. Label-Indexing on the Rows • So far we have indexed DataFrames via the columns. We will demonstrate now, how we can access rows from DataFrames via the locators 'loc' and 'iloc'. ('ix' is deprecated and will be removed in the future) city_frame = pd.DataFrame(cities, columns=("name", "population"), index=cities["country"]) print(city_frame.loc["Germany"]) | @Apptrainers
  192. | @Apptrainers
  193. | @Apptrainers
  194. | @Apptrainers
  195. Sum and Cumulative Sum • We can calculate the sum of all the columns of a DataFrame or the sum of certain columns: print(city_frame.sum()) | @Apptrainers
  196. city_frame["population"].sum() 33800614 | @Apptrainers
  197. We can use "cumsum" to calculate the cumulative sum: | @Apptrainers
  198. Assigning New Values to Columns • x is a Pandas Series. • We can reassign the previously calculated cumulative sums to the population column: city_frame["population"] = x print(city_frame) | @Apptrainers
  199. | @Apptrainers
  200. • Instead of replacing the values of the population column with the cumulative sum, we want to add the cumulative population sum as a new column with the name "cum_population". city_frame = pd.DataFrame(cities, columns=["country", "population", "cum_population"], index=cities["name"]) print(city_frame) | @Apptrainers
  201. | @Apptrainers
  202. • We can see that the column "cum_population" is set to NaN, as we haven't provided any data for it. • We will assign now the cumulative sums to this column: city_frame["cum_population"] =city_frame["population"].cumsum() print(city_frame) | @Apptrainers
  203. | @Apptrainers
  204. • We can also include a column name which is not contained in the dictionary, when we create the DataFrame from the dictionary. In this case, all the values of this column will be set to NaN: city_frame = pd.DataFrame(cities, columns=["country", "area", "population"], index=cities["name"]) print(city_frame) | @Apptrainers
  205. | @Apptrainers
  206. Accessing the Columns of a DataFrame • There are two ways to access a column of a DataFrame. The result is in both cases a Series: # in a dictionary-like way: print(city_frame["population"]) | @Apptrainers
  207. | @Apptrainers
  208. # as an attribute print(city_frame.population) | @Apptrainers
  209. | @Apptrainers
  210. print(type(city_frame.population)) <class 'pandas.core.series.Series'> | @Apptrainers
  211. city_frame.population From the previous example, we can see that we have not copied the population column. "p" is a view on the data of city_frame. | @Apptrainers
  212. Assigning New Values to a Column • The column area is still not defined. We can set all elements of the column to the same value: city_frame["area"] = 1572 print(city_frame) | @Apptrainers
  213. | @Apptrainers
  214. • In this case, it will be definitely better to assign the exact area to the cities. The list with the area values needs to have the same length as the number of rows in our DataFrame. # area in square km: area = [1572, 891.85, 605.77, 1285, 105.4, 414.6, 228, 755, 525.2, 517, 101.9, 310.4, 181.8] # area could have been designed as a list, a Series, an array or a scalar city_frame["area"] = area print(city_frame) | @Apptrainers
  215. | @Apptrainers
  216. Sorting DataFrames city_frame = city_frame.sort_values(by="area", ascending=False) print(city_frame) | @Apptrainers
  217. Let's assume, we have only the areas of London, Hamburg and Milan. The areas are in a series with the correct indices. We can assign this series as well: city_frame = pd.DataFrame(cities, columns=["country", "area", "population"], index=cities["name"]) some_areas = pd.Series([1572, 755, 181.8], index=['London', 'Hamburg', 'Milan']) city_frame['area'] = some_areas print(city_frame) | @Apptrainers
  218. | @Apptrainers
  219. Inserting new columns into existing DataFrames • In the previous example we have added the column area at creation time. Quite often it will be necessary to add or insert columns into existing DataFrames. • For this purpose the DataFrame class provides a method "insert", which allows us to insert a column into a DataFrame at a specified location: insert(self, loc, column, value, allow_duplicates=False)` | @Apptrainers
  220. | @Apptrainers
  221. city_frame = pd.DataFrame(cities, columns=["country", "population"], index=cities["name"]) idx = 1 city_frame.insert(loc=idx, column='area', value=area) print(city_frame) <class 'pandas.core.frame.DataFrame'> | @Apptrainers
  222. | @Apptrainers
  223. | @Apptrainers
  224. DataFrame from Nested Dictionaries A nested dictionary of dictionaries can be passed to a DataFrame as well. The indices of the outer dictionary are taken as the columns and the inner keys. i.e. the keys of the nested dictionaries, are used as the row indices: | @Apptrainers
  225. | @Apptrainers
  226. | @Apptrainers
  227. • You like to have the years in the columns and the countries in the rows? No problem, you can transpose the data: growth_frame.T | @Apptrainers
  228. | @Apptrainers
  229. • Consider: growth_frame = growth_frame.T growth_frame2 = growth_frame.reindex(["Switzerland", "Italy", "Germany", "Greece"]) # remove France print(growth_frame2) | @Apptrainers
  230. | @Apptrainers
  231. Filling a DataFrame with random values: import numpy as np names = ['Frank', 'Eve', 'Stella', 'Guido', 'Lara'] index = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"] df = pd.DataFrame((np.random.randn(12, 5)*1000).round(2), columns=names, index=index) print(df) randn: returns sample or samples of random numbers from a normal distribution with Mean as 1st argument and VAR as second argument. | @Apptrainers
  232. | @Apptrainers
  233. Summary • So far we have covered the following: • Python 3.0 (scalers, lists, dictionaries, loops, selection, functions) • Numpy • Pandas • The reason for studying these packages is to be able to program the 5 steps in any data science process. | @Apptrainers
Publicité