-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtipuesearch_content.json
1 lines (1 loc) · 247 KB
/
tipuesearch_content.json
1
{"pages":[{"title":"About Guillaume Redoulès","text":"I am a data scientist and a mechanical engineer working on numerical methods for stress computations in the field of rocket propulsion. Prior to that, I've got a MSc in Computational Fluid Dynamics and aerodynamics from Imperial College London. Email: guillaume.redoules@gadz.org Linkedin: Guillaume Redoulès Curriculum Vitae Experience Thermomecanical method and tools engineer , Ariane Group , 2015 - Present In charge of tools and methods related to thermomecanical computations. Focal point for machine learning. Education MSc Advanced Computational Methods for Aeronautics, Flow Management and Fluid-Structure Interaction , Imperial College London, London. 2013 Dissertation: \"Estimator design for fluid flows\" Fields: Aeronautics, aerodynamics, computational fluid dynamics, numerical methods Arts et Métiers Paristech , France, 2011 Generalist engineering degree Fields: Mechanics, electrical engineering, casting, machining, project management, finance, IT, etc.","tags":"pages","url":"redoules.github.io/pages/about.html","loc":"redoules.github.io/pages/about.html"},{"title":"Upload/Download data to/from a sharepoint folder","text":"Here is how to upload/download file to/from a sharepoint folder without having to use any convoluted authentication method. import subprocess import shutil problem = False url = 'https://site.sharepoint.com/path/to/your/folder' #mount sharepoint as a network share if subprocess . call ( f 'net use S: { url } ' , shell = True ) != 0 : subprocess . call ( r 'net use S: /delete /y' , shell = True ) if subprocess . call ( f 'net use S: { url } ' , shell = True ) != 0 : problem = True if not problem : #do some stuff shutil . copy ( \"link_to_local_file\" , \"S: \\\\ \" ) else : print ( f 'Problem with accessing the Sharepoint site : { url } ' ) #unmount the network share subprocess . call ( r 'net use S: /delete /y' , shell = True )","tags":"Python","url":"redoules.github.io/python/sharepoint.html","loc":"redoules.github.io/python/sharepoint.html"},{"title":"Add autoincremeting (Primary key) id column","text":"In this very simple example we will see how to add a column named id to a table. This table with be the primary key and will be autoincremeting. ALTER TABLE `db` . `table` ADD `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY","tags":"SQL","url":"redoules.github.io/sql/PK_id_column.html","loc":"redoules.github.io/sql/PK_id_column.html"},{"title":"What's inside my .bashr_aliases ?","text":"alias maj = 'sudo apt update && sudo apt upgrade -y' alias ms = 'ls' alias nvtop = 'watch nvidia-smi' alias h = 'history|grep ' alias f = 'find . |grep ' alias where = 'which' alias p = 'function _p(){ ps aux | grep $1 | grep -v grep;};_p' alias wget = 'wget -c' alias calc = 'python -ic \"from __future__ import division; from math import *\"' ## pass options to free ## alias meminfo = 'free -m -l -t' ## Get cpu info ## alias cpuinfo = 'lscpu' ## get top process eating memory alias psmem = 'ps auxf | sort -nr -k 4' alias psmem10 = 'ps auxf | sort -nr -k 4 | head -10' ## get top process eating cpu ## alias pscpu = 'ps auxf | sort -nr -k 3' alias pscpu10 = 'ps auxf | sort -nr -k 3 | head -10'","tags":"Linux","url":"redoules.github.io/linux/bashr_aliases.html","loc":"redoules.github.io/linux/bashr_aliases.html"},{"title":"Clear unused docker images","text":"Run the following command to clean the unused docker images docker images -q | xargs docker rmi Before bitcoin latest 7b35f12891e7 18 minutes ago 167MB <none> <none> 4b24b9dbbd9f 26 minutes ago 167MB <none> <none> 418b6137c28e 35 minutes ago 167MB debian stable-slim e7e5f8b110eb 10 days ago 69 .2MB mempool/mempool v1.0 794703676c98 11 days ago 1 .01GB drone/drone latest f2e0470417c3 2 months ago 67 .5MB eps latest 9c7be9265c84 2 months ago 458MB <none> <none> 54c483efd98e 2 months ago 458MB <none> <none> 306b5cda6db6 2 months ago 458MB influxdb latest 15b283775653 4 months ago 311MB mlflow latest 2390c1d0c9aa 5 months ago 946MB <none> <none> 5f54797bd3bc 5 months ago 946MB <none> <none> e0d050cd4f39 5 months ago 946MB <none> <none> cb544a7cbbf4 5 months ago 946MB <none> <none> 81d3d87e6e00 5 months ago 946MB <none> <none> 76deb458df92 5 months ago 946MB <none> <none> ff5c93f1c05e 5 months ago 430MB <none> <none> 3ccee6bb3930 5 months ago 638MB <none> <none> 49104572fcb2 5 months ago 638MB <none> <none> 18875f41afe1 5 months ago 1 .22GB <none> <none> 93c114c8870f 5 months ago 1 .22GB <none> <none> 8e2959da3ce1 5 months ago 1 .22GB <none> <none> a392e8be4b98 5 months ago 1 .22GB <none> <none> 11241beda07a 5 months ago 1 .22GB <none> <none> 23e94992e3e4 5 months ago 1 .2GB <none> <none> e22cc69bbfde 5 months ago 1 .2GB <none> <none> 5d3b30cf7774 5 months ago 1 .2GB <none> <none> d055ea146c1e 5 months ago 1 .2GB <none> <none> 8163fda4e56b 5 months ago 1 .2GB <none> <none> b990dc02b0dc 5 months ago 1 .2GB <none> <none> 7f3f8a79a336 5 months ago 1 .2GB <none> <none> 4bc4af0d6774 5 months ago 1 .2GB redis latest de25a81a5a0b 5 months ago 98 .2MB gitea/gitea latest 9f07e22ee4b9 7 months ago 109MB ruimarinho/bitcoin-core <none> 9a73f94058c2 7 months ago 168MB python 3 .7.0 a187104266fb 18 months ago 922MB After REPOSITORY TAG IMAGE ID CREATED SIZE bitcoin latest 7b35f12891e7 30 minutes ago 167MB debian stable-slim e7e5f8b110eb 10 days ago 69 .2MB mempool/mempool v1.0 794703676c98 11 days ago 1 .01GB eps latest 9c7be9265c84 2 months ago 458MB influxdb latest 15b283775653 4 months ago 311MB mlflow latest 2390c1d0c9aa 5 months ago 946MB gitea/gitea latest 9f07e22ee4b9 7 months ago 109MB","tags":"Linux","url":"redoules.github.io/linux/clear_docker.html","loc":"redoules.github.io/linux/clear_docker.html"},{"title":"Add task to download station via API","text":"login and session ID login to DSM using the API method login in order to get the SID (session ID). In the example, the synology is hosted on 192.168.1.2 accessed via the port 5001 over https the login is myaccount and the password is pass123 https://192.168.1.2:5001/webapi/auth.cgi?api=SYNO.API.Auth&version=2&method=login&account=myaccount&passwd=pass123&session=DownloadStation&format=cookie the return value should look like this {\"data\":{\"sid\":\"7JALx67b6pHpM1920PDN547503\"},\"success\":true} make sure to note that sid value for later. Using Download Station through the API Now that we are logged in and have the SID, we can add a download task by using the API method create to download the file https://file-examples.com/wp-content/uploads/2017/02/file-sample_100kB.doc to the share folder home . Don't forget to use the sid previouly copied. https://192.168.1.2:5001/webapi/DownloadStation/task.cgi?api=SYNO.DownloadStation.Task&version=1&method=create&uri=https://file-examples.com/wp-content/uploads/2017/02/file-sample_100kB.doc&destination=home&_sid=7JALx67b6pHpM1920PDN547503 You can use curl -k in order to call this request form the command line. You can find more information in the official API reference","tags":"Linux","url":"redoules.github.io/linux/DS_API.html","loc":"redoules.github.io/linux/DS_API.html"},{"title":"Style a dataframe","text":"DataFrame output in the notebook can be personalised with some CSS syntaxe applied to the style attribute. Basic styling Basic styling, color nan values in red, max values in blue for each Line and the min value in gold for each column. # Import modules import pandas as pd # Example dataframe raw_data = { 'fruit' : [ 'Banana' , 'Orange' , 'Apple' , 'lemon' , \"lime\" , \"plum\" ], 'color' : [ 'yellow' , 'orange' , 'red' , 'yellow' , \"green\" , \"purple\" ], 'kcal' : [ 89 , 47 , 52 , 15 , 30 , 28 ], 'size_cm' : [ 20 , 10 , 9 , 7 , None , 4 ] } df = pd . DataFrame ( raw_data , columns = [ 'fruit' , 'color' , 'kcal' , \"size_cm\" ]) ( df . style . highlight_null ( 'red' ) . highlight_max ( color = 'steelblue' , axis = 0 ) . highlight_min ( color = 'gold' , axis = 1 ) ) fruit color kcal size_cm 0 Banana yellow 89 20 1 Orange orange 47 10 2 Apple red 52 9 3 lemon yellow 15 7 4 lime green 30 nan 5 plum purple 28 4 Gradient Color the value of the dataframe with a color gradient based on the value of the cell raw_data = { 'fruit' : [ 'Banana' , 'Orange' , 'Apple' , 'lemon' , \"lime\" , \"plum\" ], 'color' : [ 'yellow' , 'orange' , 'red' , 'yellow' , \"green\" , \"purple\" ], 'kcal' : [ 89 , 47 , 52 , 15 , 30 , 28 ], 'size_cm' : [ 20 , 10 , 9 , 7 , 6 , 4 ] } df = pd . DataFrame ( raw_data , columns = [ 'fruit' , 'color' , 'kcal' , \"size_cm\" ]) df . style . background_gradient () fruit color kcal size_cm 0 Banana yellow 89 20 1 Orange orange 47 10 2 Apple red 52 9 3 lemon yellow 15 7 4 lime green 30 6 5 plum purple 28 4 Custom style Create a custom style using CSS def custom_style ( val ): if val < 5 : return 'background-color:red' elif val > 50 : return 'background-color:green' elif abs ( val ) < 20 : return 'background-color:yellow' else : return '' df [[ \"kcal\" , 'size_cm' ]] . style . applymap ( custom_style ) kcal size_cm 0 89 20 1 47 10 2 52 9 3 15 7 4 30 6 5 28 4 Colorbars Draw bars in the cell based on the value in the cell ( df . style . bar ( subset = [ 'kcal' , 'size_cm' ], color = 'steelblue' ) ) fruit color kcal size_cm 0 Banana yellow 89 20 1 Orange orange 47 10 2 Apple red 52 9 3 lemon yellow 15 7 4 lime green 30 6 5 plum purple 28 4 import numpy as np df [ \"random\" ] = np . random . rand ( 6 ) - 0.5 ( df . style . bar ( subset = [ 'kcal' , 'size_cm' ], color = 'steelblue' ) . bar ( subset = [ 'random' ], color = [ 'indianred' , 'limegreen' ], align = 'mid' ) ) fruit color kcal size_cm random 0 Banana yellow 89 20 0.432301 1 Orange orange 47 10 0.104051 2 Apple red 52 9 0.107809 3 lemon yellow 15 7 -0.281298 4 lime green 30 6 -0.122744 5 plum purple 28 4 0.452038","tags":"Python","url":"redoules.github.io/python/style_dataframe.html","loc":"redoules.github.io/python/style_dataframe.html"},{"title":"Use the pipe function for fluent pandas api","text":"pipe is a method that accepts a function pipe , by default, assumes the first argument of this function is a data frame and passes the current dataframe down the pipeline The function should return a dataframe also, if you want to continue with the chaining. Yet, it can also return any other value if you put it in the last step. This is incredibly valuable because it takes you one step further from SQL where you do things in reverse Create a sample dataframe # Import modules import pandas as pd # Example dataframe raw_data = { 'fruit' : [ 'Banana' , 'Orange' , 'Apple' , 'lemon' , \"lime\" , \"plum\" ], 'color' : [ 'yellow' , 'orange' , 'red' , 'yellow' , \"green\" , \"purple\" ], 'kcal' : [ 89 , 47 , 52 , 15 , 30 , 28 ], 'size_cm' : [ 20 , 10 , 9 , 7 , 5 , 4 ] } df = pd . DataFrame ( raw_data , columns = [ 'fruit' , 'color' , 'kcal' , \"size_cm\" ]) df fruit color kcal size_cm 0 Banana yellow 89 20 1 Orange orange 47 10 2 Apple red 52 9 3 lemon yellow 15 7 4 lime green 30 5 5 plum purple 28 4 def add_to_col ( de , col = 'kcal' , n = 200 ): ret = df . copy () # a dataframe is mutable, we use copy in order to avoid modifying any data ret [ col ] = ret [ col ] + n return ret ( df . pipe ( add_to_col ) . pipe ( add_to_col , col = 'size_cm' , n = 10 ) . head ( 5 ) ) fruit color kcal size_cm 0 Banana yellow 89 30 1 Orange orange 47 20 2 Apple red 52 19 3 lemon yellow 15 17 4 lime green 30 15","tags":"Python","url":"redoules.github.io/python/Use_the_pipe_function_for_fluent_pandas_api.html","loc":"redoules.github.io/python/Use_the_pipe_function_for_fluent_pandas_api.html"},{"title":"Time Series anomaly detection","text":"Time series anomaly detection \"An anomaly is an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.\" (Hawking 1980) \"Anomalies [...] may or not be harmful.\" (Esling 2012) Types of anomalies The anomalies in an industrial system are often influenced by external factors such as speed or product being manufactured. There external factors represent the context and should be added to the feature vector. Furthermore, there might be a difference between what you detect and what are the people actually interested in on site. On industrial systems, you would find different types of anomalies signatures. A bearing degrading or gear wear would result in a progressive shift from the normal state. Other pattern might be detected with such anomalies : the mean is going up or the amplitude of the phenomenon is increasing or a cyclic pattern appear more often. When a component breaks or when something gets stuck the anomaly signature would result in a persitent change . This type of signature would also appear after a poorly performed maintenance. IN this case, a stepwise pattern appears in the time series data. Other anomalies can appear in the data. For example, a measuring error or a short current spike caused by an induction peak can appear and is considered an anomaly because it is clearly out of trend. However, it is often the case that those anomalies are don't represent errors and are a normal part of the process. In order to alert on the anomalies that represent an error or a degradation of the system and filter out the spike anomalies, some feature engineering has to be done. Feature extraction This represent the most important part of the analysis. Either you use knowledge of the experts, intuition of literatures (especially for bearing and rotating machines). Or you perform an automated feature extraction using packages such as : HTCSA (highly comparative time-series analysis) is a library implementing more than 7000 features (use pyopy for Python on Linux and OSX). It allows to normalize and clster the data, produce low dimensional representation of the data, identify and discriminate features between different classes of time series, learn multivariate classification models, vizualise the data, etc. Catch22 reduces the 7000 features coded in HTCSA to the 22 that produced the best results across 93 real world time-series datasets. tsfresh is a package that automatically calculates a large number of time series characteristics and contains methods to evaluate the explaining power and importance of such characteristics for regression or classification tasks A combinaison of both automatically extracted knowledge and human knowledge can be combined. For instance, you can filter the spikes with a rolling median and then use catch22 on the resulting data. Or you can in parallel use your knowledge about bearing degradation and some automatically extracted feature. Unsupervised Anomaly Detection algorithms When you are using unsupervised anomaly detection algorithm you postulate that the majority is normal and you try to find outliers. Those outliers are the anomalies. This approach is useful when you only have unlabeled data. Algorithms used in this case are often : nearest neighbor / density based : Global : K-Nearest Neighbor (K-NN), DBSCAN Local : Local Outlier Factor (LOF) Clustering based: Global : Cluster Based Local Outlier Factor (CBLOF/uCBLOF) Local : Local Density Cluster-based Outlier Factor (LDCOF) The tricky part is to set k, the number of clusters and the other hyperparameters. Furthermore, this kind of alogrithms perform poorly against persitant changes because the normal and anormal states would be in two clusters but they would be identified as normal by the algorithm since they represent the majority of the data. Semi-supervised Anomaly Detection algorithms The first approach is to train the algorithm on healthy data and detect an anomaly when the distance between the measured point and the healthy cluster exceeds a value. * Distance based measures to healthy states such as the measure of the Mahalanobis distance You can also model the surface of the healthy state and detect an anomaly when the measure crosses the surface : Rich Representation of Healthy State: One-class Support Vector Machines (SVM) One-class Neuronal Networks Finally you can perform a dimension reduction of the space by finding new basis function of the state, and keeping only the n most important feature vector. An anomaly is detected when the reconstruction error grows because it is not part of what is considered normal. Reconstruction Error with Basis Functions : Principal Component Analysis (PCA) Neuronal Network (Autoencoders) Very important : Do not use dimensionality reduction (like PCA) before the anomaly detection because you would throw away all the anomalies. This kind of semi supervised approach is strongly dependent on the data. Hence if you don't have a healthy state in the training set then the output of the algorithm won't be useful. Supervised anomaly detection algorithm Here, you apply classical classification methods for machine learning. However, be careful when training your classifiers because you have very imbalanced classes. Conclusions Anomalies may or may not be harmful! Hence you have to focus on the one that can damage your system. Anomaly interpretation depend a lot on the context (spike, progressive change, persitent change) Questions for feature extraction (collective, contextual or point like): which external influence ? which kind of events should be detected ? Questions for choice of algorithm : Does data have labelled events ? -> Supervised learning Is healthy state marked ? -> Semi Supervised If no knowledge at all -> Unsupervised Questions for model deployment When is information needed (real-time vs historic)?","tags":"Blog","url":"redoules.github.io/blog/Time_Series_anomaly_detection.html","loc":"redoules.github.io/blog/Time_Series_anomaly_detection.html"},{"title":"Showing the training progress","text":"Let's see how we can show the progress and various metrics during the training process interactively in the console or the notebook. Let's first import some libraries from tensorflow import keras import numpy as np from tqdm import tqdm In this example, we will be using the fashion MNIST dataset to do some basic computer vision, where we will train a Keras neural network to classify items of clothing. In order to import the data we will be using the built in function in Keras : keras . datasets . fashion_mnist . load_data () The model is a very simple neural network consisting in 2 fully connected layers. The model loss function is chosen in order to have a multiclass classifier : \"sparse_categorical_crossentropy\" Let's define a simple feedforward network. ##get and preprocess the data fashion_mnist = keras . datasets . fashion_mnist ( train_images , train_labels ), ( test_images , test_labels ) = fashion_mnist . load_data () train_images = train_images / 255.0 test_images = test_images / 255.0 ## define the model model = keras . Sequential ([ keras . layers . Flatten ( input_shape = ( 28 , 28 )), keras . layers . Dense ( 128 , activation = \"relu\" ), keras . layers . Dense ( 10 , activation = \"softmax\" ) ]) model . compile ( optimizer = \"adam\" , loss = \"sparse_categorical_crossentropy\" , metrics = [ \"accuracy\" , 'mae' ]) We now need to define the callback by specifiying to tqdm how to show the training progress. The callback has to be added to the callbacks list in the fit method. with tqdm ( total = 10 , unit = \"epoch\" ) as t : def cbk ( epoch , logs ): t . set_postfix ( logs , refresh = False ) t . update () cbkWrapped = keras . callbacks . LambdaCallback ( on_epoch_end = cbk ) model . fit ( train_images , train_labels , epochs = t . total , verbose = 0 , callbacks = [ cbkWrapped ]) 80%|████████ | 8/10 [00:38<00:09, 4.76s/epoch, loss=0.257, acc=0.904, mean_absolute_error=4.42]","tags":"DL","url":"redoules.github.io/dl/tqdm_keras.html","loc":"redoules.github.io/dl/tqdm_keras.html"},{"title":"Showing progress","text":"It is a small and very popular progress meter. It is easy to use, fast (<100ns per iteration overhead), has an intelligent estimated time remaining, works nearly everywhere and is dependency-free. A minimal example is : from tqdm import tqdm from time import sleep for i in tqdm ( range ( 100 )): sleep ( 0.01 ) 100%|██████████| 100/100 [00:01<00:00, 98.57it/s] It works basically everywhere : * in the notebook * in the console * on MacOS * on Linux * on Windows The trange(N) is a convinience function that can be used as a shortcut for tqdm(range(N)) from tqdm import trange from time import sleep for i in trange ( 100 ): sleep ( 0.01 ) 100%|██████████| 100/100 [00:01<00:00, 98.31it/s] You can add a description and units to the progress bar from tqdm import trange from time import sleep for i in trange ( 100 , desc = 'my progress bar' , unit = \"epoch\" ): sleep ( 0.01 ) my progress bar: 100%|██████████| 100/100 [00:01<00:00, 98.77epoch/s] tqdm can be used outside of python, for example in a pipe in order to show the progress ! seq 999999 | python - m tqdm | wc - l 999999it [00:00, 2908577.61it/s] 999999 or the number of bytes per second ! seq 999999 | python - m tqdm -- bytes | wc - l 6.57MB [00:00, 232MB/s] 999999 we can also have a progress bar if we specify a total ! seq 999999 | python - m tqdm -- bytes -- total 7628000 | wc - l 90%|███████████████████████████████████▏ | 6.57M/7.27M [00:00<00:00, 225MB/s] 999999 Iterable-based use Wrap tqdm() around any iterable (i.e. list, numpy array, pandas dataframe, etc.) text = \"\" for c in tqdm ([ \"a\" , \"b\" , \"c\" , \"d\" ]): sleep ( 0.25 ) text = text + c 100%|██████████| 4/4 [00:01<00:00, 3.98it/s] The progress bar can be instantiated outside of the loop pbar = tqdm ([ \"a\" , \"b\" , \"c\" , \"d\" ]) for c in pbar : sleep ( 0.25 ) pbar . set_description ( f \"Processing { c } \" ) Proc essi ng d : 100 %| ██████████ | 4 / 4 [ 00 : 01 < 00 : 00 , 3.98 it / s ] Manual tqdm can be manually controled by using a with statement. If you specify a total (or an iterable with len() ), predictive stats are displayed. with tqdm ( total = 100 ) as pbar : for i in range ( 10 ): sleep ( 0.1 ) pbar . update ( 10 ) 100%|██████████| 100/100 [00:01<00:00, 99.08it/s] Desciption and additional stats Custom information can be displayed and updated dynamically on tqdm bars with the desc and postfix arguments. This can be useful for machine learning where we want to print the metrics or losses during the training process. from tqdm import trange from random import random , randint from time import sleep with trange ( 10 ) as t : for i in t : t . set_description ( f \"GEN { i } \" ) t . set_postfix ( loss = random (), gen = randint ( 1 , 999 ), str = \"h\" , lst = [ 1 , 2 ]) sleep ( 0.1 ) GEN 9: 100%|██████████| 10/10 [00:01<00:00, 9.80it/s, gen=927, loss=0.505, lst=[1, 2], str=h] You can customise what your bar looks like with the bar_format option. with tqdm ( total = 10 , bar_format = \" {postfix[0]} {postfix[1][value]:>8.2g} \" , postfix = [ \"Batch\" , dict ( value = 0 )]) as t : for i in range ( 10 ): sleep ( 0.1 ) t . postfix [ 1 ][ \"value\" ] = i / 2 t . update () Batch 4.5 Hooks and callbacks tqdm can be integrated with other libaries. In the example, we integrate tqdm with urllib In order to download a file in python we use the following code but it doesn't show any progress. import urllib.request , os eg_link = \"http://mirrors.melbourne.co.uk/ubuntu-releases/19.10/ubuntu-19.10-desktop-amd64.iso\" urllib . request . urlretrieve ( eg_link , filename = os . devnull , data = None ) We can create a class called TqdmUpTo to show the progress. It is recommended to use miniters=1 whenever there is a potentially large difference in iteration speed (e.g. downloading a file over a patchy connection). tqdm expect a call to update and urllib needs an update_to method class TqdmUpTo ( tqdm ): def update_to ( self , blocks_so_far = 1 , block_size = 1 , total = None ): if total is not None : self . total = total self . update ( blocks_so_far * block_size - self . n ) with TqdmUpTo ( unit = 'B' , unit_scale = True , miniters = 1 , desc = eg_link . split ( \"/\" )[ - 1 ]) as t : urllib . request . urlretrieve ( eg_link , filename = os . devnull , data = None , reporthook = t . update_to ) ubuntu-19.10-desktop-amd64.iso: 0%| | 4.38M/2.46G [00:06<56:58, 720kB/s] The hooks can be useful for dispalying the progress of training a neural network with keras with tqdm ( total = 10 , unit = \"epoch\" ) as t : def cbk ( epoch , logs ): t . set_postfix ( logs , refresh = False ) t . update () cbkWrapped = keras . callbacks . LambdaCallback ( on_epoch_end = cbk ) model . fit ( x , y , epochs = t . total , verbose = 0 , callbacks = [ cbkWrapped ]) 0 %| | 0 / 10 [ 00 : 00 <? , ? epoch / s ] tqdm can also be applyied to pandas import pandas as pd import numpy as np from tqdm import tqdm df = pd . DataFrame ( np . random . rand ( 5 , 10 )) #Registed `pandas.progress_apply`, `pandas.Series.map_apply`, etc with tqdm #tqdm_gui, tqdm_notebook, optional kwargs can be used tqdm . pandas ( desc = \"my bar!\" ) #replace apply by progress_apply or map by progress_map df . progress_apply ( lambda x : x ** 2 ) / home / guillaume / anaconda3 / lib / python3 . 7 / site - packages / tqdm / std . py : 654 : FutureWarning : The Panel class is removed from pandas. Accessing it from the top - level namespace will also be removed in the next version from pandas import Panel my bar ! : 100 %| ██████████ | 10 / 10 [ 00 : 00 < 00 : 00 , 818.46 it / s ] 0 1 2 3 4 5 6 7 8 9 0 0.072067 0.192319 0.354137 0.226678 0.092123 0.003836 0.190529 0.129519 0.272106 0.918287 1 0.048068 0.157399 0.047842 0.024008 0.029614 0.067597 0.891104 0.909314 0.000385 0.090840 2 0.965366 0.000364 0.159109 0.550288 0.494446 0.157180 0.113372 0.427566 0.086719 0.004057 3 0.323191 0.524031 0.000673 0.037876 0.274079 0.036120 0.479730 0.095726 0.085360 0.006432 4 0.707128 0.041636 0.339444 0.105327 0.111249 0.331656 0.172311 0.940055 0.003604 0.463608 Notebook integration Use tnrange via the tqdm_notebook submodule ```python from tqdm import tnrange, tqdm_notebook from time import sleep for i in tnrange(3, desc=\"1st loop\"): for j in tqdm_notebook(range(100), desc=\"2nd loop\"): sleep(0.01) If you are not sure if your users are using a notebook or a console you can use tqdm.auto ```python from tqdm.auto import tqdm tqdm.pandas()","tags":"Python","url":"redoules.github.io/python/tqdm.html","loc":"redoules.github.io/python/tqdm.html"},{"title":"Introduction to dask DataFrames","text":"Dask arrays extend the pandas interface to work on larger than memory datasets on a single machine or distributed datasets on a cluster of machines. It reuses a lot of the pandas code but extends it to larger scales. Start with pandas To see how that works, we start with pandas in order to show a bit later how similar the interfaces look. import pandas as pd df = pd . read_csv ( \"data/1.csv\" ) df . head () index X Y Z value 0 10000 0.613648 0.514523 0.675306 0.997480 1 10001 0.785925 0.418075 0.558356 0.435089 2 10002 0.382117 0.841691 0.263298 0.120973 3 10003 0.374417 0.534436 0.093729 0.104052 4 10004 0.061580 0.404272 0.826618 0.980229 Once the data is loaded, we can work on it pretty easily. For instance we can take the mean of the value column and get the result instantly. df . value . mean () 0.5004419057234618 When we want to operate on many files or if the size of the dataset is larger than memory pandas breaks down. Read all CSV files lazily with Dask DataFrames Intead of using pandas we will use the dask DataFrame to load the csv. import dask.dataframe as dd df = dd . read_csv ( \"data/1.csv\" ) df Dask DataFrame Structure: index X Y Z value npartitions=1 int64 float64 float64 float64 float64 ... ... ... ... ... Dask Name: from-delayed, 3 tasks As you can see, the dask DataFrame didn't return any data. If we want some data we can use the head function. df . head () index X Y Z value 0 10000 0.613648 0.514523 0.675306 0.997480 1 10001 0.785925 0.418075 0.558356 0.435089 2 10002 0.382117 0.841691 0.263298 0.120973 3 10003 0.374417 0.534436 0.093729 0.104052 4 10004 0.061580 0.404272 0.826618 0.980229 Like previously, we can compute the mean of the value column df . value . mean () dd.Scalar<series-..., dtype=float64> Notice that we didn't get a full result. Indeed the dask DataFrame like every Dask objects is lazy by default. You have to use the compute function to get the result. df . value . mean () . compute () 0.5004419057234627 Another advantage of Dask DataFrames is that we can work on multiple files instead of a file at once. df = dd . read_csv ( \"data/*.csv\" ) df Dask DataFrame Structure: index X Y Z value npartitions=64 int64 float64 float64 float64 float64 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Dask Name: from-delayed, 192 tasks df . value . mean () . compute () 0.5005328752185645 Index, partitions and sorting Every Dask DataFrames is composed of many Pandas DataFrames/Series arranged along the index. A Dask DataFrame is partitioned row-wise, grouping rows by index value for efficiency. These Pandas objects may live on disk or on other machines. All those partitions are loaded in parallel. df Dask DataFrame Structure: index X Y Z value npartitions=64 int64 float64 float64 float64 float64 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Dask Name: from-delayed, 192 tasks When we look at the structure of the Dask Dataframe, we see that is is composed of 192 python functions that must be run in order to run the dask dataframe. Each of this partitions is a pandas DataFrame type ( df . partitions [ 3 ] . compute ()) pandas.core.frame.DataFrame We can write a function mapped to all the partitions of the dask dataframe in order to see that we have 64 partitions. Each of them is a pandas DataFrame. df . map_partitions ( type ) . compute () 0 <class 'pandas.core.frame.DataFrame'> 1 <class 'pandas.core.frame.DataFrame'> 2 <class 'pandas.core.frame.DataFrame'> 3 <class 'pandas.core.frame.DataFrame'> 4 <class 'pandas.core.frame.DataFrame'> ... 59 <class 'pandas.core.frame.DataFrame'> 60 <class 'pandas.core.frame.DataFrame'> 61 <class 'pandas.core.frame.DataFrame'> 62 <class 'pandas.core.frame.DataFrame'> 63 <class 'pandas.core.frame.DataFrame'> Length: 64, dtype: object In the df dataframe, we notice that there is a column of unique values called index. We will use this as the index of the dataframe. df = df . set_index ( \"index\" ) df Dask DataFrame Structure: X Y Z value npartitions=64 0 float64 float64 float64 float64 9624 ... ... ... ... ... ... ... ... ... 629999 ... ... ... ... 639999 ... ... ... ... Dask Name: sort_index, 642 tasks This operation requires to load all the data in order to find the minimal and maximal values of this column. Thanks to this operation, if we want to get some data contained between two indices dask will know in which file to find the data and won't have to reload all the files. Write the data to Parquet Parquet is a columnar file format and is tightly integrated with both dask and pandas. You can you the to_parquet function to export the dataframe to a parquet file. df . to_parquet ( \"data/data.parquet\" )","tags":"Python","url":"redoules.github.io/python/dask_dataframes.html","loc":"redoules.github.io/python/dask_dataframes.html"},{"title":"Introduction to dask arrays","text":"Dask arrays extend the numpy interface to larger than memory and parallel workflows across a distributed cluster. They look and feel a lot like numpy and use numpy under the hood. Indeed Dask arrays coordinate many NumPy arrays arranged into a grid. These NumPy arrays may live on disk or on other machines. You can create a dask array like following. We create an array filled with 1 of lenght 15. We have to specify a chunk size. import dask.array as da x = da . ones ( 15 , chunks = ( 5 ,)) x Array Chunk Bytes 120 B 40 B Shape (15,) (5,) Count 3 Tasks 3 Chunks Type float64 numpy.ndarray 15 1 The output is a dask array composed of 3 numpy arrays of size 5 each. If we try to compute the sum of all the elements of the array we won't get the result by using the sum method. Indeed, Dask objects are lazy by default and run the computations only when instructed. x . sum () Array Chunk Bytes 8 B 8 B Shape () () Count 7 Tasks 1 Chunks Type float64 numpy.ndarray We have to call compute at the end of each operation if we want the result. x . sum () . compute () 15.0 The above example is pretty trival. Let's say now that we have a 10000 by 10000 array. We choose to represente that array by chunks of 1000 by 1000. x = da . random . random (( 10000 , 10000 ), chunks = ( 1000 , 1000 )) x Array Chunk Bytes 800.00 MB 8.00 MB Shape (10000, 10000) (1000, 1000) Count 100 Tasks 100 Chunks Type float64 numpy.ndarray 10000 10000 Dask has created a 10 by 10 grid where each element of that list is a 1000 by 1000 numpy array. Again, we can do operations on that dataset that are very similar to the way we would do it with numpy. First let's add the array to its transpose y = x + x . T Then slice the array and take the mean. z = y [:: 2 , :] . mean ( axis = 1 ) z Array Chunk Bytes 40.00 kB 4.00 kB Shape (5000,) (500,) Count 540 Tasks 10 Chunks Type float64 numpy.ndarray 5000 1 When we want to get access to the result, we just need to use the compute method and dask will compute the result in parallel on the different cores of the machine. z . compute () array([0.99874649, 0.99722168, 0.99725464, ..., 1.00849801, 1.00448204, 0.99683664]) In practice, dask is often used in tandem with data file formats like HDF5, zar or netcdf. In this situation you might load a file from disk and use the from_array function. import h5py f = h5py . File ( \"myfile.hdf5\" ) d = f [ \"/data/path\" ] d . shape ( 10000000 , 1000000 ) import dask.array as da x = da . from_array ( d , chunks = ( 10000 , 10000 )) x . mean ( axis = 0 ) . compute ()","tags":"Python","url":"redoules.github.io/python/dask_array.html","loc":"redoules.github.io/python/dask_array.html"},{"title":"Compute the intersection of 2 numpy arrays","text":"In this article we will use sets to compute the intersection of 2 numpy arrays import numpy as np array2a = np . array ([[ 1 , 2 ], [ 3 , 3 ], [ 2 , 1 ], [ 1 , 3 ], [ 2 , 1 ]]) array2b = np . array ([[ 2 , 1 ], [ 1 , 4 ], [ 3 , 3 ]]) a = set (( tuple ( i ) for i in array2a )) b = set (( tuple ( i ) for i in array2b )) a . intersection ( b ) # {(2, 1), (3, 3)} {(2, 1), (3, 3)}","tags":"Python","url":"redoules.github.io/python/intersection_array.html","loc":"redoules.github.io/python/intersection_array.html"},{"title":"Get min and max distance withing a point cloud","text":"In this article we will see how to filter all the point whose distance to any point in an ensemble of points is greater than a specified value. For example, we have two set of points : * the source in orange * the target in blue And we want to find all the points in the target ensemble that are at most at a distance of 0.5 of any point in the source distribution. % matplotlib inline import matplotlib.pyplot as plt import numpy as np target = np . random . normal ( 0 , 1 , ( 100 , 2 )) source = np . random . random (( 100 , 2 )) plt . scatter ( source [:, 0 ], source [:, 1 ], color = \"orange\" ) plt . scatter ( target [:, 0 ], target [:, 1 ], color = \"blue\" ) plt . show () In order to do so, we will use the cdist function form the scipy.spatial.distance package. This function computes the distance between each pair of the two collections of points. We compute the minimum distance form all the points in the target ensemble from any point in the source ensemble. dist = cdist ( source , target ) . min ( axis = 0 ) Once the distance has been computed, we filter out all the points that have more distant that the threshold value. dist [ dist > thres ] = False dist [ dist != False ] = True after that, we only need to format the result array and filter out all the zero values. a = np . array ([ target [:, 0 ] * dist , target [:, 1 ] * dist ]) . T return a [ ~ ( a == 0 ) . all ( 1 )] from scipy.spatial.distance import cdist def filter_too_far ( target , source , thres = 1 ): \"\"\" Filters out all the points in the target array whose distance to any point in the source array is greater than the threshold value This function is made for 2D points \"\"\" dist = cdist ( source , target ) . min ( axis = 0 ) dist [ dist > thres ] = False dist [ dist != False ] = True a = np . array ([ target [:, 0 ] * dist , target [:, 1 ] * dist ]) . T return a [ ~ ( a == 0 ) . all ( 1 )] Here we have in green all the point in the target ensemble that are distant from any point in the source ensemble of at most 0.5 units filtered = filter_too_far ( target , source , thres = 0.5 ) plt . scatter ( source [:, 0 ], source [:, 1 ], color = \"orange\" ) plt . scatter ( target [:, 0 ], target [:, 1 ], color = \"blue\" , marker = \"+\" ) plt . scatter ( filtered [:, 0 ], filtered [:, 1 ], color = \"green\" , marker = \"+\" ) <matplotlib.collections.PathCollection at 0x7fde375a4490> This function can easily be adapted to work for 3D-points from scipy.spatial.distance import cdist def filter_too_far ( target , source , thres = 1 ): \"\"\" Filters out all the points in the target array whose distance to any point in the source array is greater than the threshold value This function is made for 3D points \"\"\" dist = cdist ( source , target ) . min ( axis = 0 ) dist [ dist > thres ] = False dist [ dist != False ] = True a = np . array ([ target [:, 0 ] * dist , target [:, 1 ] * dist , target [:, 2 ] * dist ]) . T return a [ ~ ( a == 0 ) . all ( 1 )] Note that you are not restricted to the euclidian distance, the cdist function can use the following distances: * braycurtis * canberra * chebyshev * cityblock * correlation * cosine * dice * euclidean * hamming * jaccard * jensenshannon * kulsinski * mahalanobis * matching * minkowski * rogerstanimoto * russellrao * euclidean * sokalmichener * sokalsneath * sqeuclidean * wminkowski * yule","tags":"Python","url":"redoules.github.io/python/points_too_far_away.html","loc":"redoules.github.io/python/points_too_far_away.html"},{"title":"Open an image with PIL","text":"Python has a library called PIL (short for Python Image Library). With openCV, it provides a very useful set of objects to manipulate image data. In order to read the content of an image, we will create an Image object from PIL import Image We now need to specify the image filename as an argument of the open function img = Image . open ( \"../images/load_image_PIL/myimage.png\" ) If you are in the notebook, you can display the image by calling the object you just created img If you pass the image object to the constructor of a Numpy array, the values of the image will be stored in that array import numpy as np np . array ( img ) array ([[[ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], ..., [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ]], [[ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], ..., [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ]], [[ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], ..., [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ]], ..., [[ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], ..., [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ]], [[ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], ..., [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ]], [[ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], ..., [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ], [ 255 , 255 , 255 , 255 ]]], dtype = uint8 )","tags":"Python","url":"redoules.github.io/python/load_image_PIL.html","loc":"redoules.github.io/python/load_image_PIL.html"},{"title":"Recursively compute md5 checksum","text":"The md5 checksum for all files in a folder can be recursively computed then stored in a text file called md5sum.txt with the command :* cd /you/folder/of/interest find -type f \\( -not -name \"md5sum.txt\" \\) -exec md5sum '{}' \\; > md5sum.txt","tags":"Linux","url":"redoules.github.io/linux/md5_recursif.html","loc":"redoules.github.io/linux/md5_recursif.html"},{"title":"Find the file owner","text":"You can find the owner of a file by running the following command. The command will return the owner of the file and the domain import win32api import win32con import win32security def owner ( file ): sd = win32security . GetFileSecurity ( file , win32security . OWNER_SECURITY_INFORMATION ) owner_sid = sd . GetSecurityDescriptorOwner () name , domain , type = win32security . LookupAccountSid ( None , owner_sid ) return ( name , domain ) filename = \"my.file\" print ( f \"The owner of the file { filename } is { owner ( filename )[ 0 ] } \" ) The owner of the file my.file is my.user","tags":"Python","url":"redoules.github.io/python/file_owner.html","loc":"redoules.github.io/python/file_owner.html"},{"title":"File creation date in Windows","text":"You can find the date of creating of a file by running the following command import os import time def creation_date ( path_to_file ): return time . strftime ( '%Y-%m- %d %H-%M-%S' , time . localtime ( os . path . getctime ( path_to_file ))) creation_date ( \"my.file\" ) '2019-11-04 14-35-54'","tags":"Python","url":"redoules.github.io/python/file_creation_date.html","loc":"redoules.github.io/python/file_creation_date.html"},{"title":"Get min and max distance withing a point cloud","text":"Here we will learn how to find the maximal or minimal distance between two points in a cloud of points. To do so, we will use the pdist function available in the scipy.spatial.distance package. This function computes the pairwise distances between observations in n-dimensional space; in order to find the longest or shortest distance, juste take the max or min. import numpy as np from scipy.spatial.distance import pdist points = np . random . random (( 10 , 3 )) #generate 100 points pairwise = pdist ( points ) # compute the pairwise distance between those points #compute the maximal and minimal distance print ( f \"maximal distance : { np . max ( pairwise ) } \" ) print ( f \"minimal distance : { np . min ( pairwise ) } \" ) maximal distance : 1.1393617436726384 minimal distance : 0.2382615513731064 The pdist function can that different metric for the distance computation. The default metrics are : * braycurtis * canberra * chebyshev * cityblock * correlation * cosine * dice * euclidean * hamming * jaccard * jensenshannon * kulsinski * mahalanobis * matching * minkowski * rogerstanimoto * russellrao * seuclidean * sokalmichener * sokalsneath * sqeuclidean * yule You can also define your own distances with a lambda function np . max ( pdist ( points , lambda u , v : np . sqrt ((( u - v ) ** 2 ) . sum ()))) 1.1393617436726384 or with a classical function def dfun ( u , v ): return np . sqrt ((( u - v ) ** 2 ) . sum ()) np . max ( pdist ( points , dfun )) 1.1393617436726384","tags":"Python","url":"redoules.github.io/python/point_cloud_distance.html","loc":"redoules.github.io/python/point_cloud_distance.html"},{"title":"Natural sort of list","text":"Natural sort order is an ordering of strings in alphabetical order, except that multi-digit numbers are ordered as a single character. Natural sort order has been promoted as being more human-friendly (\"natural\") than the machine-oriented pure alphabetical order. For example, in alphabetical sorting \"z11\" would be sorted before \"z2\" because \"1\" is sorted as smaller than \"2\", while in natural sorting \"z2\" is sorted before \"z11\" because \"2\" is sorted as smaller than \"11\". def natural_sort ( l ): \"\"\" return the list l in a natural sort order \"\"\" convert = lambda text : int ( text ) if text . isdigit () else text . lower () alphanum_key = lambda key : [ convert ( c ) for c in re . split ( \"([0-9]+)\" , key )] return sorted ( l , key = alphanum_key )","tags":"Python","url":"redoules.github.io/python/natural_sort.html","loc":"redoules.github.io/python/natural_sort.html"},{"title":"Save a numpy array to disk","text":"In this article we will learn how to save a numpy array to the disk. We will then see how to load it back from the disk into memory. First, let't import numpy. # Import modules import numpy as np We will generate an array to demonstrate saving and loading. myarray = np . arange ( 10 ) Numpy arrays can be save to the disk to the binary .npy format by using the save method. np . save ( \"C: \\\\ temp \\\\ arr.npy\" , myarray ) Once saved, it can be retrived from the disk by using the load method. my_other_array = np . load ( \"C: \\\\ temp \\\\ arr.npy\" ) my_other_array array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])","tags":"Python","url":"redoules.github.io/python/numpy_save.html","loc":"redoules.github.io/python/numpy_save.html"},{"title":"Setting up MariaDB for Remote Client Access","text":"Some MariaDB packages bind MariaDB to 127.0.0.1 (the loopback IP address) by default as a security measure using the bind-address configuration directive, in that case, one can't connect to the MariaDB server from other hosts or from the same host over TCP/IP on a different interface than the loopback (127.0.0.1). The list of users existing remote users can be accessed with the following SQL statement on the mysql.user table: SELECT User, Host FROM mysql.user WHERE Host <> 'localhost'; +-----------+-----------+ | User | Host | +-----------+-----------+ | Guillaume | % | | root | 127.0.0.1 | | root | ::1 | +-----------+-----------+ 4 rows in set (0.00 sec) We will create a \"root\" user that can connect from anywhere with the local area network (LAN), which has addresses in the subnet 192.168.1.0/24. This is an improvement because opening a MariaDB server up to the Internet and granting access to all hosts is bad practice. GRANT ALL PRIVILEGES ON *.* TO 'root'@'192.168.1.%' IDENTIFIED BY 'my-new-password' WITH GRANT OPTION; (% is a wildcard)","tags":"SQL","url":"redoules.github.io/sql/remote_access.html","loc":"redoules.github.io/sql/remote_access.html"},{"title":"Log experiements","text":"Machine learning is a very iterative process, algorithm have multiples hyperparameters to keep track of. And the performance of the models evolves as you get more data. In order to manage the model lifecycle, we will use mlflow. First, import mlflow import mlflow Mlflow can be run on the local computer in order to try it out but I recommend deploying it on a server. In our case, the server is located on the local network at 192.168.1.5:4444. The mlflow client can connect to it via the set_tracking_url method mlflow . set_tracking_uri ( \"http://192.168.1.5:4444\" ) Mlflow can be used to record and query experiements : code, data, config, results... Let's specify that we are working on my-experiment with the method set_experiment . If the experiement does not exist, it will be created. mlflow . set_experiment ( \"my-experiment\" ) mlflow . log_param ( \"num_dimensions\" , 8 ) mlflow . log_param ( \"regularization\" , 0.1 ) Metrics can be logged as well in mlflow, just use the log_metric method. mlflow . log_metric ( \"accuracy\" , 0.1 ) mlflow . log_metric ( \"accuracy\" , 0.45 ) Metrics can be updated at a later time. The changes will be tracked across versions. You can use MLflow Tracking in any environment (for example, a standalone script or a notebook) to log results to local files or to a server, then compare multiple runs. Using the web UI, you can view and compare the output of multiple runs. Teams can also use the tools to compare results from different users:","tags":"Machine Learning","url":"redoules.github.io/machine-learning/Log_experiments_mlflow.html","loc":"redoules.github.io/machine-learning/Log_experiments_mlflow.html"},{"title":"Filter or select lines of a DataFrame containing values in a list","text":"In this article we will learn to filter the lines of a dataframe based on the values contained in a column of that dataframe. This is simular to the \"Filter\" functionnality of Excel. Let's first create our dataframe : # Import modules import pandas as pd # Example dataframe raw_data = { 'fruit' : [ 'Banana' , 'Orange' , 'Apple' , 'lemon' , \"lime\" , \"plum\" ], 'color' : [ 'yellow' , 'orange' , 'red' , 'yellow' , \"green\" , \"purple\" ], 'kcal' : [ 89 , 47 , 52 , 15 , 30 , 28 ] } df = pd . DataFrame ( raw_data , columns = [ 'fruit' , 'color' , 'kcal' ]) df fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 If we want to extract all the lines where the value of the color column is yellow, we would proceed like so : df [ df [ \"color\" ] == \"yellow\" ] fruit color kcal 0 Banana yellow 89 3 lemon yellow 15 Now, if we want to filter the DataFrame by a list of values we would rather use the isin method like this : df [ df [ \"color\" ] . isin ([ \"yellow\" , \"red\" ])] fruit color kcal 0 Banana yellow 89 2 Apple red 52 3 lemon yellow 15","tags":"Python","url":"redoules.github.io/python/select_lines_values_list.html","loc":"redoules.github.io/python/select_lines_values_list.html"},{"title":"List all sections in a config file","text":"A config file is partionned in sections.Here is an examples of a config file named config.ini : [section1] var_a:hello var_b:world [section2] myvariable: 42 There are two sections in this config file, you can access to them in python by calling the sections method of the ConfigParser class import configparser config = configparser . ConfigParser () config . read ( \"config.ini\" ) config . sections () ['section1', 'section2']","tags":"Python","url":"redoules.github.io/python/config_list.html","loc":"redoules.github.io/python/config_list.html"},{"title":"Getting traffic data from google maps","text":"Goal of the project We will scrap google maps in order to find the travel time from a grid of points to a couple of destinations. This way, we will find the most optimal points to minimize both journeys. This code can be used to pinpoint the best locations to pick a home when two people are working at different locations. By scrapping google maps, we can take into account how the traffic impacts the travel time. You can download the project by going to the GitHub repository Scrapping google maps Since google maps is a dynamic website, we cannot use simple tools such as wget or curl. Even webparsers such as scrappy don't render the DOM hence cannot work in this situation. The easiest way to scrap data from such websites is to take control of a browser by using an automation tool. In this case we will use selenium to take control of Google Chrome with the chromedriver. You have to install selenium with conda install -c conda-forge selenium or pip install selenium you also need to have the chromedriver.exe downloaded. BeautifulSoup is a package we will use to parse the html of the webpage opened in chrome. In order to extract the estimated travel time, we need to inspect the source code of the page in find the element we are interested in. In our case it is section-directions-trip-numbers . In this <div> element we will then get the estimated value contained in the <span> element The code First, let's import selenium, beautiful soup and some other libraries # Selenium allows to control chrome programmatically from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.chrome.options import Options #beautifulsoup is used to parse the dom of the html page import bs4 as BeautifulSoup import numpy as np import pandas as pd import os We will also need some extra libraries for plotting the results import matplotlib.pyplot as plt from matplotlib.transforms import offset_copy import cartopy.crs as ccrs import cartopy.io.img_tiles as cimgt Let's define the GPS coordinates of the two destinations we are interested in. The coordinates can be found the the URL of a google maps search longitudeDestination1 = 48.9361537 latitudeDestination1 = 2.2507129 longitudeDestination2 = 48.7783875 latitudeDestination2 = 2.1803534 We will search on an equally spaced grid of point starting from (long_begin, lat_begin) and going to (long_end, lat_end). In order to do so, we will : * construct the URL from the GPS coordinates * load the url in chrome with driver.get * read the resulting html with driver.page_source * parse the html with beautiful soup in order to find the first <div> element with the class section-directions-trip-numbers * in this element, we will get the estimated travel time by reading the text value of the second <span> element def get_travel_time ( url , driver ): \"\"\" get the estimated travel time of the google maps given as url \"\"\" resultats = None driver . get ( url ) while resultats == None : soupe = BeautifulSoup . BeautifulSoup ( driver . page_source , \"lxml\" ) soupe . select ( \"section-directions-trip-numbers\" ) resultats = soupe . find ( 'div' , attrs = { \"class\" : u \"section-directions-trip-numbers\" }) return resultats . find_all ( \"span\" )[ 2 ] Once the function is defined, we only need to call it in a loop in order to get all the point of the grid chrome_options = Options () #chrome_options.add_argument(\"--disable-extensions\") #chrome_options.add_argument(\"--disable-gpu\") chrome_options . add_argument ( \"--headless\" ) #make chrome headless. If you want to see the autimation, comment this line driver = webdriver . Chrome ( executable_path = '. \\\\ chromedriver.exe' , chrome_options = chrome_options ) nb = 10 ctn = 0 time = [] for coordX in np . linspace ( long_begin , long_end , nb ): for coordY in np . linspace ( lat_begin , lat_end , nb ): url_journey1 = f \"https://www.google.com/maps/dir/ { coordX } , { coordY } /@ { longitudeDestination1 } , { latitudeDestination1 } ,12z/data=!3m1!4b1!4m14!4m13!1m0!1m5!1m1!1s0x47e67bff078f6575:0x95df2619f9304bd7!2m2!1d2.1825421!2d48.778384!2m4!2b1!6e0!7e2!8j1570521600!3e0\" url_journey2 = f 'https://www.google.com/maps/dir/ { coordX } , { coordY } /@ { longitudeDestination2 } , { latitudeDestination2 } ,14z/data=!3m1!4b1!4m14!4m13!1m0!1m5!1m1!1s0x47e665df0cb0b919:0x5f513cdf2fe6d39d!2m2!1d2.2572779!2d48.9368666!2m4!2b1!6e0!7e2!8j1570521600!3e0' temps_user1 = get_travel_time ( url_journey1 , driver ) temps_user2 = get_travel_time ( url_journey2 , driver ) ctn += 1 print ( f 'Downloaded : { ctn / ( nb * nb ) * 100 } %' ) time . append ([ coordX , coordY , temps_user1 . text , temps_user2 . text , f 'https://www.google.com/maps/place/ { coordX } , { coordY } ' ]) Downloaded : 1.0 % Downloaded : 2.0 % Downloaded : 3.0 % Downloaded : 4.0 % [...] Downloaded : 96.0 % Downloaded : 97.0 % Downloaded : 98.0 % Downloaded : 99.0 % Downloaded : 100.0 % After gathering the results, the values stored in the time list are string and cannot be interpreted as numerical values without a post processing. This is why I've written the function analyse_time in order to split the text and convert it to a numerical format expressed in minutes. def analyse_time ( time ): \"\"\" Analyse the time given by google maps, splits the lower and higher estimate and converts them to minutes \"\"\" tlow = time . split ( \" - \" )[ 0 ] . replace ( \" \\xa0 \" , \" \" ) thigh = time . split ( \" - \" )[ 1 ] . replace ( \" \\xa0 \" , \" \" ) if ( \"min\" not in tlow ) and ( \"h\" not in tlow ): #example : 26 tlow = int ( tlow . replace ( \" \" , \"\" )) elif \"h\" not in tlow : # example 26 min tlow = tlow . replace ( \"min\" , \"\" ) tlow = int ( tlow . replace ( \" \" , \"\" )) else : if \"min\" in tlow : #example 1h 26min tlow = tlow . split ( \"h\" ) tlow = 60 * int ( tlow [ 0 ] . replace ( \" \" , \"\" )) + int ( tlow [ 1 ] . replace ( \"min\" , \"\" ) . replace ( \" \" , \"\" )) else : #example 1h tlow = 60 * int ( tlow . replace ( \"h\" , \"\" )) if \"h\" not in thigh : thigh = thigh . replace ( \"min\" , \"\" ) thigh = int ( thigh . replace ( \" \" , \"\" )) else : if \"min\" in thigh : thigh = thigh . split ( \"h\" ) thigh = 60 * int ( thigh [ 0 ] . replace ( \" \" , \"\" )) + int ( thigh [ 1 ] . replace ( \"min\" , \"\" ) . replace ( \" \" , \"\" )) else : thigh = 60 * int ( thigh . replace ( \"h\" , \"\" )) return ( tlow , thigh ) For every result previously gathered, let's apply the function analyse_time then put it in a pandas dataframe. While we are at it, I also computed the geometric mean of the minimum time estimated for both users as well of the maximum time. A geometric mean is interesting in this interesting here because we want to avoid have one user doing a long journey while the other is doing a short one. df = [] for t in time : lat = t [ 0 ] lon = t [ 1 ] t1 = analyse_time ( t [ 2 ]) t2 = analyse_time ( t [ 3 ]) geomlow = np . sqrt ( t1 [ 0 ] * t2 [ 0 ]) #geometric mean geomhigh = np . sqrt ( t1 [ 1 ] * t2 [ 1 ]) #geometric mean df . append ([ lat , lon , geomlow , geomhigh , t1 [ 0 ], t1 [ 1 ], t2 [ 0 ], t2 [ 1 ]]) traveltime = pd . DataFrame ( df , columns = [ \"latitude\" , \"longitude\" , \"geometric mean low\" , \"geometric mean high\" , \"time low 1\" , \"time high 1\" , \"time low 2\" , \"time high 2\" ]) traveltime = traveltime . sort_values ( \"geometric mean low\" ) traveltime = traveltime . reset_index () traveltime . to_csv ( \"extraction.csv\" , index = False ) #save it to csv traveltime . head ( 10 ) #print the 10 first rows index latitude longitude geometric mean low geometric mean high time low 1 time high 1 time low 2 time high 2 0 38 48.833333 2.231111 16.124515 33.166248 10 20 26 55 1 89 48.888889 2.260000 16.248077 36.742346 22 45 12 30 2 97 48.900000 2.202222 16.733201 40.987803 35 70 8 24 3 6 48.800000 2.173333 16.733201 35.777088 8 16 35 80 4 98 48.900000 2.231111 17.320508 41.109610 30 65 10 26 5 99 48.900000 2.260000 17.663522 37.815341 26 55 12 26 6 88 48.888889 2.231111 17.663522 40.620192 26 55 12 30 7 7 48.800000 2.202222 17.748239 34.641016 9 16 35 75 8 68 48.866667 2.231111 18.000000 37.416574 18 35 18 40 9 87 48.888889 2.202222 18.330303 43.874822 28 55 12 35 Ok now that we finished preparing the data, it's time to draw some maps. We will use caropy in order to download some Google maps tiles. You might need to manually change the extent of the map. % matplotlib inline plt . rcParams [ 'figure.figsize' ] = 20 , 12 # Create a Stamen terrain background instance. stamen_terrain = cimgt . GoogleTiles () fig = plt . figure () # Create a GeoAxes in the tile's projection. ax = fig . add_subplot ( 1 , 1 , 1 , projection = stamen_terrain . crs ) # Limit the extent of the map to a small longitude/latitude range. ax . set_extent ([ lat_begin * 0.975 , lat_end * 1.02 , long_begin * 0.999 , long_end * 1.001 ], crs = ccrs . Geodetic ()) # Add the Stamen data at zoom level 10. ax . add_image ( stamen_terrain , 10 ) Now, we draw the 10 points that minimize time for user 1, color them is red and make the size of the pot proportionnal to the travel time of the second user. And we do the same for he 10 points that minimize time for user 2, color them is blue and make the size of the pot proportionnal to the travel time of the first user. for i , point in traveltime . sort_values ( \"time low 1\" ) . iterrows (): if i < 10 : ax . plot ( point . longitude , point . latitude , marker = 'o' , c = 'red' , markersize = point [ \"time low 2\" ], alpha = 0.5 , transform = ccrs . Geodetic ()) for i , point in traveltime . sort_values ( \"time low 2\" ) . iterrows (): if i < 10 : ax . plot ( point . longitude , point . latitude , marker = 'o' , c = 'blue' , markersize = point [ \"time low 1\" ], alpha = 0.5 , transform = ccrs . Geodetic ()) To help with the vizualisation, we add two stars on the maps in order to mark the location of the 2 destinations. # Add a marker for destination 1 ax . plot ( latitudeDestination1 , longitudeDestination1 , marker = '*' , c = 'green' , markersize = 25 , alpha = 1 , transform = ccrs . Geodetic ()) # Add a marker for destination 2 ax . plot ( latitudeDestination2 , longitudeDestination2 , marker = '*' , c = 'orange' , markersize = 25 , alpha = 1 , transform = ccrs . Geodetic ()) geodetic_transform = ccrs . Geodetic () . _as_mpl_transform ( ax ) text_transform = offset_copy ( geodetic_transform , units = 'dots' , x =- 25 ) Finally, we draw the maps. The optimal point is where both the dots in blue and in red are smaller. plt . show ()","tags":"Projects","url":"redoules.github.io/projects/get_traffic_data.html","loc":"redoules.github.io/projects/get_traffic_data.html"},{"title":"Get items in one dictionnary but not the other one","text":"Say we have two similar dictonnaries, we want to find all the items that are in the second dictonnary but not in the first one. dict1 = { \"Banana\" : \"yellow\" , \"Orange\" : \"orange\" } dict2 = { \"Banana\" : \"yellow\" , \"Orange\" : \"orange\" , \"Lemon\" : \"yellow\" } In the following code we find the difference of the keys and then rebuild a dict taking the corresponding values. difference = { k : dict2 [ k ] for k in set ( dict2 ) - set ( dict1 ) } difference {'Lemon': 'yellow'} Be careful, this operation is not symetric. This means that if we want to find a value present in dict1 but not in dict2 this code won't work dict1 = { \"Banana\" : \"yellow\" , \"Orange\" : \"orange\" , \"Lemon\" : \"yellow\" } dict2 = { \"Banana\" : \"yellow\" , \"Orange\" : \"orange\" } difference = { k : dict2 [ k ] for k in set ( dict2 ) - set ( dict1 ) } difference {}","tags":"Python","url":"redoules.github.io/python/compare_dict.html","loc":"redoules.github.io/python/compare_dict.html"},{"title":"Maximize a window in Windows","text":"A window can be maximized by using the win32 API. We will need to import win32gui and win32con import win32gui import win32con In order to maximize a windows, we need its handle number. In our example we will get the handle of the foreground window by using the GetForegroundWindow method. handle = win32gui . GetForegroundWindow () The window can then be maximized with the command win32gui . ShowWindow ( handle , win32con . SW_MAXIMIZE ) 24 minimized with the command win32gui . ShowWindow ( handle , win32con . SW_MINIMIZE ) 24 or restored to its original size with the command win32gui . ShowWindow ( handle , win32con . SW_NORMAL ) 24 The win32 API allow to hide or show a window. import time win32gui . ShowWindow ( handle , win32con . SW_HIDE ) #hide the window time . sleep ( 1 ) #keep it hidden for a second win32gui . ShowWindow ( handle , win32con . SW_SHOW ) #show the window 0","tags":"Python","url":"redoules.github.io/python/maximize_window.html","loc":"redoules.github.io/python/maximize_window.html"},{"title":"Iterate over a dictionnary","text":"To iterate over a dictionnary in a for loop, you have to use the following syntaxe for key , value in d . items (): Here is a examples printing the keys and values of a dictionnary d = { \"key1\" : \"value 1\" , \"key2\" : \"value 2\" , \"key3\" : \"value 3\" } for key , value in d . items (): print ( f ' { key } : { value } ' ) key1 : value 1 key2 : value 2 key3 : value 3","tags":"Python","url":"redoules.github.io/python/iterate_dict.html","loc":"redoules.github.io/python/iterate_dict.html"},{"title":"Case sensitive ConfigParser","text":"In order to have a case sensite ConfigParser, simply replace the ConfigParser with the following class : from configparser import ConfigParser class CaseConfigParser ( ConfigParser ): def optionxform ( self , optionstr ): return optionstr","tags":"Python","url":"redoules.github.io/python/case_config.html","loc":"redoules.github.io/python/case_config.html"},{"title":"Write a value to a config file","text":"You can store variables in an ini file for later executions of the script instead of hardcoding the values in the script. ConfigParser can take a pointer to a file an write values to that file using the ini syntaxe config = ConfigParser () config . read ( \"file.ini\" ) config . set ( \"my_section\" , \"my_param\" , \"my_value\" ) with open ( \"file.ini\" , \"w\" ) as f : config . write ( f ) Here is a function designed to help you update ini files def saveParam ( pathToFile , section , param , value ): \"\"\" Save/add value to an ini file pathToFile : string, path to the ini file section : string, name of the section to write to param : string, name of the parameter to write the value to value : value to be written to the parameter If the file doesn't exist, it will be created If the section doesn't exist, it will be created \"\"\" config = ConfigParser () if not os . path . isfile ( pathToFile ): #create the file if it does not exist with open ( pathToFile , \"w\" ) as f : pass config . read ( pathToFile ) if not config . has_section ( section ): #create the section if it does not exist config . add_section ( section ) config . set ( section , param , value ) with open ( pathToFile , \"w\" ) as f : #write value to the file config . write ( f )","tags":"Python","url":"redoules.github.io/python/write_config_file.html","loc":"redoules.github.io/python/write_config_file.html"},{"title":"Parula colormap for matplotlib","text":"from matplotlib.colors import LinearSegmentedColormap cm_data = [[ 0.2081 , 0.1663 , 0.5292 ], [ 0.2116238095 , 0.1897809524 , 0.5776761905 ], [ 0.212252381 , 0.2137714286 , 0.6269714286 ], [ 0.2081 , 0.2386 , 0.6770857143 ], [ 0.1959047619 , 0.2644571429 , 0.7279 ], [ 0.1707285714 , 0.2919380952 , 0.779247619 ], [ 0.1252714286 , 0.3242428571 , 0.8302714286 ], [ 0.0591333333 , 0.3598333333 , 0.8683333333 ], [ 0.0116952381 , 0.3875095238 , 0.8819571429 ], [ 0.0059571429 , 0.4086142857 , 0.8828428571 ], [ 0.0165142857 , 0.4266 , 0.8786333333 ], [ 0.032852381 , 0.4430428571 , 0.8719571429 ], [ 0.0498142857 , 0.4585714286 , 0.8640571429 ], [ 0.0629333333 , 0.4736904762 , 0.8554380952 ], [ 0.0722666667 , 0.4886666667 , 0.8467 ], [ 0.0779428571 , 0.5039857143 , 0.8383714286 ], [ 0.079347619 , 0.5200238095 , 0.8311809524 ], [ 0.0749428571 , 0.5375428571 , 0.8262714286 ], [ 0.0640571429 , 0.5569857143 , 0.8239571429 ], [ 0.0487714286 , 0.5772238095 , 0.8228285714 ], [ 0.0343428571 , 0.5965809524 , 0.819852381 ], [ 0.0265 , 0.6137 , 0.8135 ], [ 0.0238904762 , 0.6286619048 , 0.8037619048 ], [ 0.0230904762 , 0.6417857143 , 0.7912666667 ], [ 0.0227714286 , 0.6534857143 , 0.7767571429 ], [ 0.0266619048 , 0.6641952381 , 0.7607190476 ], [ 0.0383714286 , 0.6742714286 , 0.743552381 ], [ 0.0589714286 , 0.6837571429 , 0.7253857143 ], [ 0.0843 , 0.6928333333 , 0.7061666667 ], [ 0.1132952381 , 0.7015 , 0.6858571429 ], [ 0.1452714286 , 0.7097571429 , 0.6646285714 ], [ 0.1801333333 , 0.7176571429 , 0.6424333333 ], [ 0.2178285714 , 0.7250428571 , 0.6192619048 ], [ 0.2586428571 , 0.7317142857 , 0.5954285714 ], [ 0.3021714286 , 0.7376047619 , 0.5711857143 ], [ 0.3481666667 , 0.7424333333 , 0.5472666667 ], [ 0.3952571429 , 0.7459 , 0.5244428571 ], [ 0.4420095238 , 0.7480809524 , 0.5033142857 ], [ 0.4871238095 , 0.7490619048 , 0.4839761905 ], [ 0.5300285714 , 0.7491142857 , 0.4661142857 ], [ 0.5708571429 , 0.7485190476 , 0.4493904762 ], [ 0.609852381 , 0.7473142857 , 0.4336857143 ], [ 0.6473 , 0.7456 , 0.4188 ], [ 0.6834190476 , 0.7434761905 , 0.4044333333 ], [ 0.7184095238 , 0.7411333333 , 0.3904761905 ], [ 0.7524857143 , 0.7384 , 0.3768142857 ], [ 0.7858428571 , 0.7355666667 , 0.3632714286 ], [ 0.8185047619 , 0.7327333333 , 0.3497904762 ], [ 0.8506571429 , 0.7299 , 0.3360285714 ], [ 0.8824333333 , 0.7274333333 , 0.3217 ], [ 0.9139333333 , 0.7257857143 , 0.3062761905 ], [ 0.9449571429 , 0.7261142857 , 0.2886428571 ], [ 0.9738952381 , 0.7313952381 , 0.266647619 ], [ 0.9937714286 , 0.7454571429 , 0.240347619 ], [ 0.9990428571 , 0.7653142857 , 0.2164142857 ], [ 0.9955333333 , 0.7860571429 , 0.196652381 ], [ 0.988 , 0.8066 , 0.1793666667 ], [ 0.9788571429 , 0.8271428571 , 0.1633142857 ], [ 0.9697 , 0.8481380952 , 0.147452381 ], [ 0.9625857143 , 0.8705142857 , 0.1309 ], [ 0.9588714286 , 0.8949 , 0.1132428571 ], [ 0.9598238095 , 0.9218333333 , 0.0948380952 ], [ 0.9661 , 0.9514428571 , 0.0755333333 ], [ 0.9763 , 0.9831 , 0.0538 ]] parula_map = LinearSegmentedColormap . from_list ( 'parula' , cm_data ) # For use of \"viscm view\" test_cm = parula_map % matplotlib inline import matplotlib.pyplot as plt import numpy as np plt . rcParams [ 'figure.figsize' ] = 20 , 12 try : from viscm import viscm viscm ( parula_map ) except ImportError : print ( \"viscm not found, falling back on simple display\" ) plt . imshow ( np . linspace ( 0 , 100 , 256 )[ None , :], aspect = 'auto' , cmap = parula_map ) plt . show ()","tags":"Python","url":"redoules.github.io/python/Parula.html","loc":"redoules.github.io/python/Parula.html"},{"title":"List all opened windows on Windows","text":"You can use the function get_all_windows to get a dictonnary containing the titles of the opened windows as keys and the handles of those windows as values import win32gui def get_all_windows (): \"\"\" Returns dict with window desc and hwnd, \"\"\" def _MyCallback ( hwnd , extra ): hwnds , classes = extra hwnds . append ( hwnd ) classes [ win32gui . GetWindowText ( hwnd )] = hwnd windows = [] classes = {} win32gui . EnumWindows ( _MyCallback , ( windows , classes )) return classes get_all_windows () {'': 3802422, 'Forcepad driver tray window': 65676, 'Jauge de batterie': 131542, 'Network Flyout': 131650, 'Dashlane': 5570658, 'Wox': 131770, 'JupyterLab - Brave': 66990, 'python': 4261478, 'Visual Studio Code - Insiders': 329780, 'Code - Insiders': 526478, 'Documents': 526010, 'Windows PowerShell': 198580, 'Progression': 394934, 'Microsoft Edge': 131586, 'Microsoft Store': 197328, 'QTrayIconMessageWindow': 327816, 'Hidden Window': 459506, '.NET-BroadcastEventWindow.4.0.0.0.3e2c690.0': 131824, 'SystemResourceNotifyWindow': 197346, 'MediaContextNotificationWindow': 197344, 'Resilio Sync 2.6.3': 262934, } print ( \"List of all opened windows : \" ) for key , value in get_all_windows () . items (): if key != \"\" : print ( \" \\t * \" + key . split ( \" \\n \" )[ 0 ]) List of all opened windows : * Forcepad driver tray window * Jauge de batterie * Network Flyout * Dashlane * Wox * JupyterLab - Brave * python * Visual Studio Code - Insiders * Code - Insiders * Documents * Windows PowerShell * Progression * Microsoft Edge * Microsoft Store * QTrayIconMessageWindow * Hidden Window * .NET-BroadcastEventWindow.4.0.0.0.3e2c690.0 * SystemResourceNotifyWindow * MediaContextNotificationWindow * Resilio Sync 2.6.3","tags":"Python","url":"redoules.github.io/python/list_windows.html","loc":"redoules.github.io/python/list_windows.html"},{"title":"Reverse column order in pandas","text":"# Import modules import pandas as pd # Example dataframe raw_data = { 'fruit' : [ 'Banana' , 'Orange' , 'Apple' , 'lemon' , \"lime\" , \"plum\" ], 'color' : [ 'yellow' , 'orange' , 'red' , 'yellow' , \"green\" , \"purple\" ], 'kcal' : [ 89 , 47 , 52 , 15 , 30 , 28 ] } df = pd . DataFrame ( raw_data , columns = [ 'fruit' , 'color' , 'kcal' ]) df fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 The columns of a dataframe can be reversed by using the loc accessor and passing :,::-1 . The : before the , means select all rows and the ::-1 after the , means reverse the column order df . loc [:,:: - 1 ] kcal color fruit 0 89 yellow Banana 1 47 orange Orange 2 52 red Apple 3 15 yellow lemon 4 30 green lime 5 28 purple plum","tags":"Python","url":"redoules.github.io/python/reverse_column_order.html","loc":"redoules.github.io/python/reverse_column_order.html"},{"title":"Reverse row order in pandas","text":"# Import modules import pandas as pd # Example dataframe raw_data = { 'fruit' : [ 'Banana' , 'Orange' , 'Apple' , 'lemon' , \"lime\" , \"plum\" ], 'color' : [ 'yellow' , 'orange' , 'red' , 'yellow' , \"green\" , \"purple\" ], 'kcal' : [ 89 , 47 , 52 , 15 , 30 , 28 ] } df = pd . DataFrame ( raw_data , columns = [ 'fruit' , 'color' , 'kcal' ]) df fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 The rows of a dataframe can be reversed by using the loc accessor and passing ::-1 . This notation is the same as the one used to reverse a list in python df . loc [:: - 1 ] fruit color kcal 5 plum purple 28 4 lime green 30 3 lemon yellow 15 2 Apple red 52 1 Orange orange 47 0 Banana yellow 89 If you want to reset the index as well so that the dataframe starts with a 0, you can combine what we just learned with the reset_index method df . loc [:: - 1 ] . reset_index ( drop = True ) fruit color kcal 0 plum purple 28 1 lime green 30 2 lemon yellow 15 3 Apple red 52 4 Orange orange 47 5 Banana yellow 89 that way, the rows are in reverse order but the index column has been re-initialized so it starts with 0","tags":"Python","url":"redoules.github.io/python/reverse_row_order.html","loc":"redoules.github.io/python/reverse_row_order.html"},{"title":"Rename columns in pandas","text":"In this article, we will be renaming columns in a pandas dataframe. First, let's import pandas and create an example dataframe # Import modules import pandas as pd # Example dataframe raw_data = { 'fruit' : [ 'Banana' , 'Orange' , 'Apple' , 'lemon' , \"lime\" , \"plum\" ], 'color' : [ 'yellow' , 'orange' , 'red' , 'yellow' , \"green\" , \"purple\" ], 'kcal' : [ 89 , 47 , 52 , 15 , 30 , 28 ] } df = pd . DataFrame ( raw_data , columns = [ 'fruit' , 'color' , 'kcal' ]) df fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 The most flexible method for renaming columns in pandas is the rename method. It takes a dictornnary as an argument where : * the keys are the old names * the values are the new names you also need to specify the axis. This method can be used to rename either one or multiple columns df = df . rename ({ \"fruit\" : \"produce\" , \"kcal\" : \"energy\" }, axis = \"columns\" ) df produce color energy 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 If you want to rename all the columns at once, a common method is to rewrite the columns attribute of the dataframe df . columns = [ \"nice fruit\" , \"bright color\" , \"light kcal\" ] df nice fruit bright color light kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 If the only thing you are doing is replacing a space with an underscore, an even better method is to use the str.replace method since you don't have to type all the column names df . columns = df . columns . str . replace ( \" \" , \"_\" ) df nice_fruit bright_color light_kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 Similarly, you can use other str methods such as : * capitalize : in order to converts first character to capital letter * lower : in order to have lowercase column names * upper : in order to have uppercase column names * etc. df . columns = df . columns . str . capitalize () df Nice_fruit Bright_color Light_kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 Finaly, if you only need to add a prefix or a suffix to the columns, you can use the add_prefix method df . add_prefix ( \"pre_\" ) pre_Nice_fruit pre_Bright_color pre_Light_kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 or the add_suffix method df . add_suffix ( \"_post\" ) Nice_fruit_post Bright_color_post Light_kcal_post 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28","tags":"Python","url":"redoules.github.io/python/rename_columns_pandas.html","loc":"redoules.github.io/python/rename_columns_pandas.html"},{"title":"Update a value in a TABLE","text":"In this very simple example we will see how to update a row in a sql database Since we are working in the notebook, we will load the sql extension in order to manipulate the database. The database mydatabase.db is a SQLite database already created before the example. #load the extension % load_ext sql #connect to the database % sql sqlite : /// mydatabase . db 'Connected: @mydatabase.db' The content of the TABLE is the following % sql SELECT * FROM tutyfrutty * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 7 Cranberry red 308 We want to update the kcal value for \"Orange\". We will use the UPDATE keyword in order to do so. We can use the index value in order to access the row and set a new value. % sql UPDATE \"tutyfrutty\" SET \"kcal\" = 48 WHERE \"index\" = 1 ; % sql SELECT * FROM tutyfrutty WHERE \"index\" = 1 ; * sqlite:///mydatabase.db 1 rows affected. * sqlite:///mydatabase.db Done. index fruit color kcal 1 Orange orange 48 The WHERE keyword can be used to specify mutilple rows. For instance, if we want to change the kcal value of all yellow fruits : % sql UPDATE \"tutyfrutty\" SET \"kcal\" = 126 WHERE \"color\" = \"yellow\" ; % sql SELECT * FROM tutyfrutty * sqlite:///mydatabase.db 2 rows affected. * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 126 1 Orange orange 48 2 Apple red 52 3 lemon yellow 126 4 lime green 30 5 plum purple 28 7 Cranberry red 308","tags":"SQL","url":"redoules.github.io/sql/update_table.html","loc":"redoules.github.io/sql/update_table.html"},{"title":"Parse variable from config file","text":"You can store variables in an ini file for later executions of the script instead of hardcoding the values in the script. Here is the content of config.ini : [section1] var_a:hello var_b:world [section2] myvariable: 42 Use the configparser library for an easy access and parsing of the file. import configparser config = configparser . ConfigParser () config . read ( \"config.ini\" ) var_a = config . get ( \"section1\" , \"var_a\" ) var_b = config . get ( \"section1\" , \"var_b\" ) myvariable = config . get ( \"section2\" , \"myvariable\" ) print ( var_a , var_b ) print ( myvariable ) hello world 42","tags":"Python","url":"redoules.github.io/python/config_parse.html","loc":"redoules.github.io/python/config_parse.html"},{"title":"Granting access to a database to a remote client by IP","text":"The MariaDB users are granted access based on where the connection is coming from. By default the access is restricted to: * 127.0.0.1 * localhost The program you are using to connect is not indentifying itself as 127.0.0.1 or localhost. You will have to verify the IP access it's being identified as, then add that to your grant table. In order to add grant an IP address you can use the following commands : GRANT ALL ON *.* TO 'user'@'computer.host.com'; GRANT ALL ON *.* TO 'user'@'192.168.1.6'; GRANT ALL ON *.* TO 'user'@'%'; The % is a wildcard that means any IP. Be careful when using it especially on the root user. If you want to remove the rights you granted, you can use the command REVOKE ALL PRIVILEGES ON *.* TO 'user'@'computer.host.com'; REVOKE ALL PRIVILEGES ON *.* TO 'user'@'192.168.1.6'; REVOKE ALL PRIVILEGES ON *.* FROM 'user'@'%'","tags":"SQL","url":"redoules.github.io/sql/grant_ip.html","loc":"redoules.github.io/sql/grant_ip.html"},{"title":"Using pandas extended types","text":"Historically, pandas is bound to numpy arrays and its limitations : * integer and bool types cannot store missing data, indeed, np.nan is of type np.float * non-numeric types such as categorical, datetime are not natively supported. Indeed, internally, pandas relies on numpy arrays to efficiently store and perform operations on the data. With the recent versions of pandas, it is possible to define custom types. In order to avoid having to extensively update the internal pandas code with each new extensions, it is now possible to define : * an extension type which describes the data type and can be registered for creation via a string (i.e. ...astype(\"example\") ) * an extension array which is a class that handles the datatype. There is no real restriction on its construction though it must be convertible to a numpy array in order to make it work with the functions implemented in pandas. It is limited to one dimension. Some implementations are already implemented in pandas. For example, IntegerNA is a pandas extension that can handle missing integer data without casting to float (and loosing the precision). In order to use IntergerNA, you need to specify the type as \"Int64\" with a capital \"i\" import pandas as pd import numpy as np s = pd . Series ([ 1 , np . nan , 2 ]) print ( s ) s_ext = pd . Series ([ 1 , np . nan , 2 ], dtype = \"Int64\" ) print ( s_ext ) 0 1.0 1 NaN 2 2.0 dtype: float64 0 1.0 1 NaN 2 2.0 dtype: Int64 It is possible to leverage this extension to avoid running out of memory. For instance, we can store data using the \"UInt16\" type, this will avoid having to cast to float64. Other uses might be handling special data types such as : * ip adresses (see cyberpandas) * gps locations * etc... sources : * https://pandas.pydata.org/pandas-docs/stable/extending.html * PyData LA 2018 Extending Pandas with Custom Types - Will Ayd","tags":"Python","url":"redoules.github.io/python/pandas_extended_types.html","loc":"redoules.github.io/python/pandas_extended_types.html"},{"title":"Basic operations with SED","text":"In this article we will learn about some of the main uses we can use sed for : * replacing * deleting * printing For this example we will learn how to remove the comments starting with the '#' sign and the blank lines for the following file : ## Header of input . csv # this file contains information I want to parse with a simple program . # The header , the footer or any comment starting with a \"#\" will be removed # The blank lines will also be removed # img , processed , defaut # bloc 1 0 , a0000 . tif ,, 1 , a0001 . tif , True , \"(139, 63)(145, 91)\" 2 , a0002 . tif , True , \"(93, 72)(24, 162)(31, 64)\" 3 , a0003 . tif ,, 4 , a0004 . tif ,, 5 , a0005 . tif ,, 6 , a0006 . tif ,, 7 , a0007 . tif ,, 8 , a0008 . tif ,, 9 , a0009 . tif , True , \"(127, 80)(104, 60)(87, 63)(53, 78)(17, 126)\" 10 , a0010 . tif ,, 11 , a0011 . tif , True , \"(39, 78)(84, 110)\" # a random comment passing by # end of bloc 1 # bloc 2 12 , a0012 . tif ,, 13 , a0013 . tif ,, 14 , a0014 . tif ,, 15 , a0015 . tif , True , \"(146, 65)(146, 89)(139, 146)(16, 68)\" 16 , a0016 . tif , True , \"(51, 59)(77, 69)(145, 78)(139, 112)(97, 123)(17, 148)\" 17 , a0017 . tif ,, # end of bloc 2 # bloc 3 18 , a0018 . tif ,, 19 , a0019 . tif ,, 20 , a0020 . tif , True , \"(57, 99)(12, 113)(27, 139)(16, 158)\" 21 , a0021 . tif ,, 22 , a0022 . tif ,, 23 , a0023 . tif ,, 24 , a0024 . tif ,, 25 , a0025 . tif ,, 26 , a0026 . tif ,, # end of bloc 3 27 , a0027 . tif , True , \"(11, 86)(29, 74)(92, 68)(109, 129)(132, 104)\" 28 , a0028 . tif ,, 29 , a0029 . tif , True , \"(128, 58)\" 30 , a0030 . tif , True , \"(133, 59)(99, 77)(111, 100)(115, 153)\" 31 , a0031 . tif , True , \"(43, 154)(27, 177)\" ## footer : end of file Anatomy of a SED command If we run the command : sed \"\" input.csv Everything inside the brackets will be interpreted as a sed command. In our case, there is nothing hence the file will be printed to the consol without any modification. You can put in the quotation mark one of sed's many commands for instance s that stands for substitude and is one of the most commonly used. In our csv we use the comma separator, let's say that we want to change it to a semicolon. We would do : sed \"s/,/;/\" input.csv so we have s meaning that we want to use the replace command followed by a / and the caracter(s) we want to replace followed by a / and the caracter(s) we want to replace it with and finally a / . The result is the following : ## Header # this file contains information I want to parse with a simple program . # The header ; the footer or any comment starting with a \"#\" will be removed # The blank lines will also be removed # img ; processed , defaut # bloc 1 0 ; a0000 . tif ,, 1 ; a0001 . tif , True , \"(139, 63)(145, 91)\" 2 ; a0002 . tif , True , \"(93, 72)(24, 162)(31, 64)\" 3 ; a0003 . tif ,, 4 ; a0004 . tif ,, 5 ; a0005 . tif ,, 6 ; a0006 . tif ,, 7 ; a0007 . tif ,, 8 ; a0008 . tif ,, 9 ; a0009 . tif , True , \"(127, 80)(104, 60)(87, 63)(53, 78)(17, 126)\" 10 ; a0010 . tif ,, 11 ; a0011 . tif , True , \"(39, 78)(84, 110)\" # a random comment passing by # end of bloc 1 # bloc 2 12 ; a0012 . tif ,, 13 ; a0013 . tif ,, 14 ; a0014 . tif ,, 15 ; a0015 . tif , True , \"(146, 65)(146, 89)(139, 146)(16, 68)\" 16 ; a0016 . tif , True , \"(51, 59)(77, 69)(145, 78)(139, 112)(97, 123)(17, 148)\" 17 ; a0017 . tif ,, # end of bloc 2 # bloc 3 18 ; a0018 . tif ,, 19 ; a0019 . tif ,, 20 ; a0020 . tif , True , \"(57, 99)(12, 113)(27, 139)(16, 158)\" 21 ; a0021 . tif ,, 22 ; a0022 . tif ,, 23 ; a0023 . tif ,, 24 ; a0024 . tif ,, 25 ; a0025 . tif ,, 26 ; a0026 . tif ,, # end of bloc 3 27 ; a0027 . tif , True , \"(11, 86)(29, 74)(92, 68)(109, 129)(132, 104)\" 28 ; a0028 . tif ,, 29 ; a0029 . tif , True , \"(128, 58)\" 30 ; a0030 . tif , True , \"(133, 59)(99, 77)(111, 100)(115, 153)\" 31 ; a0031 . tif , True , \"(43, 154)(27, 177)\" ## footer : end of file As you can see, only the first comma has been replaced, in order to repeat the command multiple times per line we need to specify the g option sed \"s/,/;/g\" input.csv ## Header # this file contains information I want to parse with a simple program . # The header ; the footer or any comment starting with a \"#\" will be removed # The blank lines will also be removed # img ; processed ; defaut # bloc 1 0 ; a0000 . tif ;; 1 ; a0001 . tif ; True ; \"(139; 63)(145; 91)\" 2 ; a0002 . tif ; True ; \"(93; 72)(24; 162)(31; 64)\" 3 ; a0003 . tif ;; 4 ; a0004 . tif ;; 5 ; a0005 . tif ;; 6 ; a0006 . tif ;; 7 ; a0007 . tif ;; 8 ; a0008 . tif ;; 9 ; a0009 . tif ; True ; \"(127; 80)(104; 60)(87; 63)(53; 78)(17; 126)\" 10 ; a0010 . tif ;; 11 ; a0011 . tif ; True ; \"(39; 78)(84; 110)\" # a random comment passing by # end of bloc 1 # bloc 2 12 ; a0012 . tif ;; 13 ; a0013 . tif ;; 14 ; a0014 . tif ;; 15 ; a0015 . tif ; True ; \"(146; 65)(146; 89)(139; 146)(16; 68)\" 16 ; a0016 . tif ; True ; \"(51; 59)(77; 69)(145; 78)(139; 112)(97; 123)(17; 148)\" 17 ; a0017 . tif ;; # end of bloc 2 # bloc 3 18 ; a0018 . tif ;; 19 ; a0019 . tif ;; 20 ; a0020 . tif ; True ; \"(57; 99)(12; 113)(27; 139)(16; 158)\" 21 ; a0021 . tif ;; 22 ; a0022 . tif ;; 23 ; a0023 . tif ;; 24 ; a0024 . tif ;; 25 ; a0025 . tif ;; 26 ; a0026 . tif ;; # end of bloc 3 27 ; a0027 . tif ; True ; \"(11; 86)(29; 74)(92; 68)(109; 129)(132; 104)\" 28 ; a0028 . tif ;; 29 ; a0029 . tif ; True ; \"(128; 58)\" 30 ; a0030 . tif ; True ; \"(133; 59)(99; 77)(111; 100)(115; 153)\" 31 ; a0031 . tif ; True ; \"(43; 154)(27; 177)\" ## footer : end of file Removing comments In order to remove comments, we can replace the pattern of a comment by nothing. A commant starts with a # sign and is followed by an arbitrarly long chain of characters. In order to match this pattern, we will use a regular expression. sed \"s/#.*//g\" input.csv #.* means : find strings that start with a # , the . stands for any character, finally, the * means that the . can be repeated any number of times. That means that sed will look for a string starting with a # followed by any characters that come after in the line. If we want to do something a bit cleaner, we can try to remove any whitespace before the comments as well. In order to do so, we will use the \\s regular expression that represents a whitespace. If we want to make sure that we removed any whitespace before the comments, we will do \\s* The final regular expression is then \\s*#.* and it will then be replaced by nothing. Finally, the command becomes sed \"s/\\s*#.*//g\" input.csv If we run that command, all our comments have disappeared 0 , a0000 . tif ,, 1 , a0001 . tif , True , \"(139, 63)(145, 91)\" 2 , a0002 . tif , True , \"(93, 72)(24, 162)(31, 64)\" 3 , a0003 . tif ,, 4 , a0004 . tif ,, 5 , a0005 . tif ,, 6 , a0006 . tif ,, 7 , a0007 . tif ,, 8 , a0008 . tif ,, 9 , a0009 . tif , True , \"(127, 80)(104, 60)(87, 63)(53, 78)(17, 126)\" 10 , a0010 . tif ,, 11 , a0011 . tif , True , \"(39, 78)(84, 110)\" 12 , a0012 . tif ,, 13 , a0013 . tif ,, 14 , a0014 . tif ,, 15 , a0015 . tif , True , \"(146, 65)(146, 89)(139, 146)(16, 68)\" 16 , a0016 . tif , True , \"(51, 59)(77, 69)(145, 78)(139, 112)(97, 123)(17, 148)\" 17 , a0017 . tif ,, 18 , a0018 . tif ,, 19 , a0019 . tif ,, 20 , a0020 . tif , True , \"(57, 99)(12, 113)(27, 139)(16, 158)\" 21 , a0021 . tif ,, 22 , a0022 . tif ,, 23 , a0023 . tif ,, 24 , a0024 . tif ,, 25 , a0025 . tif ,, 26 , a0026 . tif ,, 27 , a0027 . tif , True , \"(11, 86)(29, 74)(92, 68)(109, 129)(132, 104)\" 28 , a0028 . tif ,, 29 , a0029 . tif , True , \"(128, 58)\" 30 , a0030 . tif , True , \"(133, 59)(99, 77)(111, 100)(115, 153)\" 31 , a0031 . tif , True , \"(43, 154)(27, 177)\" Removing blank lines In order to remove the blank lines, we need to specify to sed a pattern corresponding to a blank line and use the d command where the d stands for delete. The delete command expects the pattern to the between / Now let's define what pattern a blank line corresponds to. Since there is no symbol for blankness, we can do the following ^$ . ^ means the begining of a line and $ corresponds to the end of a line. So whenever we find a blank line we delete it. sed \"/^ $ / d\" input.csv ## Header #this file contains information I want to parse with a simple program. #The header, the footer or any comment starting with a \"#\" will be removed #The blank lines will also be removed #img,processed,defaut #bloc 1 0,a0000.tif,, 1,a0001.tif,True,\"(139, 63)(145, 91)\" 2,a0002.tif,True,\"(93, 72)(24, 162)(31, 64)\" 3,a0003.tif,, 4,a0004.tif,, 5,a0005.tif,, 6,a0006.tif,, 7,a0007.tif,, 8,a0008.tif,, 9,a0009.tif,True,\"(127, 80)(104, 60)(87, 63)(53, 78)(17, 126)\" 10,a0010.tif,, 11,a0011.tif,True,\"(39, 78)(84, 110)\" # a random comment passing by #end of bloc 1 #bloc 2 12,a0012.tif,, 13,a0013.tif,, 14,a0014.tif,, 15,a0015.tif,True,\"(146, 65)(146, 89)(139, 146)(16, 68)\" 16,a0016.tif,True,\"(51, 59)(77, 69)(145, 78)(139, 112)(97, 123)(17, 148)\" 17,a0017.tif,, #end of bloc 2 #bloc 3 18,a0018.tif,, 19,a0019.tif,, 20,a0020.tif,True,\"(57, 99)(12, 113)(27, 139)(16, 158)\" 21,a0021.tif,, 22,a0022.tif,, 23,a0023.tif,, 24,a0024.tif,, 25,a0025.tif,, 26,a0026.tif,, #end of bloc 3 27,a0027.tif,True,\"(11, 86)(29, 74)(92, 68)(109, 129)(132, 104)\" 28,a0028.tif,, 29,a0029.tif,True,\"(128, 58)\" 30,a0030.tif,True,\"(133, 59)(99, 77)(111, 100)(115, 153)\" 31,a0031.tif,True,\"(43, 154)(27, 177)\" ## footer : end of file combining the sed commands We can concatenate sed commands by separating them with a semicolon. Hence the final sed command will be : sed \"s/\\s*#.*//g;/^ $ / d\" input.csv and the final output is 0,a0000.tif,, 1,a0001.tif,True,\"(139, 63)(145, 91)\" 2,a0002.tif,True,\"(93, 72)(24, 162)(31, 64)\" 3,a0003.tif,, 4,a0004.tif,, 5,a0005.tif,, 6,a0006.tif,, 7,a0007.tif,, 8,a0008.tif,, 9,a0009.tif,True,\"(127, 80)(104, 60)(87, 63)(53, 78)(17, 126)\" 10,a0010.tif,, 11,a0011.tif,True,\"(39, 78)(84, 110)\" 12,a0012.tif,, 13,a0013.tif,, 14,a0014.tif,, 15,a0015.tif,True,\"(146, 65)(146, 89)(139, 146)(16, 68)\" 16,a0016.tif,True,\"(51, 59)(77, 69)(145, 78)(139, 112)(97, 123)(17, 148)\" 17,a0017.tif,, 18,a0018.tif,, 19,a0019.tif,, 20,a0020.tif,True,\"(57, 99)(12, 113)(27, 139)(16, 158)\" 21,a0021.tif,, 22,a0022.tif,, 23,a0023.tif,, 24,a0024.tif,, 25,a0025.tif,, 26,a0026.tif,, 27,a0027.tif,True,\"(11, 86)(29, 74)(92, 68)(109, 129)(132, 104)\" 28,a0028.tif,, 29,a0029.tif,True,\"(128, 58)\" 30,a0030.tif,True,\"(133, 59)(99, 77)(111, 100)(115, 153)\" 31,a0031.tif,True,\"(43, 154)(27, 177)\" Overwritting the file If we want to overwrite the result of sed on the input file, we need to add the option -i . So if you are running sed without -i , it is safe and won't alter your files. sed -i \"s/\\s*#.*//g;/^ $ / d\" input.csv","tags":"Linux","url":"redoules.github.io/linux/sed.html","loc":"redoules.github.io/linux/sed.html"},{"title":"How to create a ramdisk?","text":"Writing to a SSD Let's first write 500Mo to a ssd with the command : dd if = /dev/zero of = test.iso bs = 1M count = 500 On my machine with a ssd I get the following results 500+0 records in 500+0 records out 524288000 bytes (524 MB, 500 MiB) copied, 10,153 s, 51,6 MB/s Writing to a ramdisk Let's now create a temporary ram disk and see how faster it is mkdir /mnt/ram mount -t tmpfs tmpfs /mnt/ram -o size = 600M cd /mnt/ram dd if = /dev/zero of = test.iso bs = 1M count = 500 The results yield a faster write time : 500+0 records in 500+0 records out 524288000 bytes (524 MB, 500 MiB) copied, 6,39849 s, 81,9 MB/s on a machine with linux installed natively the results shoud be even faster.","tags":"Linux","url":"redoules.github.io/linux/ramdisk.html","loc":"redoules.github.io/linux/ramdisk.html"},{"title":"Kullback Leibler Divergence","text":"The Kullback Leibler Divergence also famously called KL Divergence. The KL divergence actually measures the difference between any two distributions. The KL divergence is defined as : $$D_{KL}(p||q)=\\int_{X} p(x)log\\frac{p(x)}{q(x)} dx$$ And it is always non-negative and zero only when P is equal to Q. Indeed, when P is equal to Q, the log of 1 is zero, and that's why the distance is zero. Otherwise, it is always some non-negative quantity. It is can be view as a distance measure but in reality it is not because it isn't a symertic operator and because it doesn't follow the triangle law. $$D_{KL}(p||q)\\neq D_{KL}(q||p)$$ Where to use the KL-Divergence In supervised learning you are always trying to model our data to a particular distribution. So in that case our \\(P\\) can be the unknown distribution. We usually want to build an estimated probability distribution \\(Q\\) based on the sample samples \\(X\\) . When the estimator is perfect, \\(P\\) and \\(Q\\) are the same hence \\(D_{KL}(p||q) = 0\\) . This means that the KL-Divergence can be used as a mesure of the error. Jonathon Shlens explains that the KL-Divergence can be interpreted as measuring the likelihood that samples form an empirical distribution \\(p\\) were generated by a fixed distribution \\(q\\) . $$D_{KL}(p||q)=\\int_{X} p(x)log\\frac{p(x)}{q(x)}dx$$ $$D_{KL}(p||q)=\\int_{X}\\left( -p(x)log q(x) +p(x) log p(x) \\right) dx$$ The entropy of p is defined as \\(H(p)=-\\int_{X}p(x) log(p(x)) dx\\) The cross entropy between p and q is defined as $H(p,q)=-\\int_{X} p(x)log q(x) $ Hence: $$D_{KL}(p||q)=H(p,q) - H(p)$$ In many machine learning algorithm, in particular deep learning, the optimization problem revolves around minimizing the cross-entropy. Taking into account what we learnt above : $$H(p,q) = H(p) + D_{KL}(p||q) $$ The cross entropy between two probability distributions \\(p\\) and \\(q\\) over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an \"artificial\" probability distribution \\(q\\) , rather than the \"true\" distribution \\(p\\) . This means that, the cross entropy corresponds to the number of bits needed to encode the distribution \\(p\\) given by \\(H(p)\\) the entropy of \\(p\\) plus the number of bits needed to encode the directed divergence between the two distributions \\(p\\) and \\(q\\) given by \\(D_{KL}(p||q)\\) , the Kullback Leibler Divergence","tags":"Mathematics","url":"redoules.github.io/mathematics/KLD.html","loc":"redoules.github.io/mathematics/KLD.html"},{"title":"Downloading from quandl","text":"In this example we will download the price of bitcoin from quandl import quandl btc_price_data = quandl . get ( \"BCHARTS/COINBASEEUR\" ) btc_price_data . head () Open High Low Close Volume (BTC) Volume (Currency) Weighted Price Date 2015-05-11 215.85 219.15 214.67 217.83 145.863137 31656.416712 217.028218 2015-05-12 218.04 218.50 214.16 215.50 127.225520 27467.730961 215.897966 2015-05-13 216.43 217.45 208.53 208.88 111.808285 24014.796862 214.785486 2015-05-14 209.23 209.83 204.94 207.93 148.228400 30825.088327 207.956696 2015-05-15 207.95 209.51 207.62 208.55 127.718800 26586.847126 208.167060","tags":"Python","url":"redoules.github.io/python/quandl.html","loc":"redoules.github.io/python/quandl.html"},{"title":"Computing the Mayer multiple","text":"I've learnt about the Mayer mutliple from The Inverstor Podcast . The Mayer multiple is the ratio of the bitcoin price divided by the 200-day moving average. It is designed to understand the price of bitcoin without taking in account the short term volatility. It helps investors filter out their emotions during a bull run. Let's see how to compute the Mayer mutliple in python. First, we need to import the data, we will use Quandl to download data from coinbase import quandl btc_price_data = quandl . get ( \"BCHARTS/COINBASEEUR\" ) btc_price_data . tail () Open High Low Close Volume (BTC) Volume (Currency) Weighted Price Date 2018-12-12 2966.00 3076.71 2952.05 3026.00 1447.627465 4.372890e+06 3020.728514 2018-12-13 3025.19 3028.06 2861.15 2886.91 2125.242928 6.261750e+06 2946.369017 2018-12-14 2886.91 2919.00 2800.32 2835.50 2527.558347 7.256959e+06 2871.134083 2018-12-15 2835.49 2865.00 2781.47 2830.45 1267.004758 3.568614e+06 2816.575409 2018-12-16 2830.45 2830.45 2830.44 2830.45 0.144249 4.082886e+02 2830.447385 Next we need to compute the 200 days moving average of the price of bitcoin moving_averages = btc_price_data [[ \"Open\" , \"High\" , \"Low\" , \"Close\" ]] . rolling ( window = 200 ) . mean () moving_averages . tail () Open High Low Close Date 2018-12-12 5507.88295 5611.72570 5380.60150 5491.43610 2018-12-13 5491.44155 5595.14440 5363.77840 5474.38560 2018-12-14 5474.39910 5577.89585 5347.27500 5457.99125 2018-12-15 5457.94930 5559.49145 5330.78235 5439.75570 2018-12-16 5439.71660 5540.89425 5313.62955 5422.17700 Finally, we can compute the ratio and plot it. % matplotlib inline import matplotlib.pyplot as plt plt . rcParams [ 'savefig.dpi' ] = 300 plt . rcParams [ 'figure.dpi' ] = 163 plt . rcParams [ 'figure.autolayout' ] = False plt . rcParams [ 'figure.figsize' ] = 20 , 12 plt . rcParams [ 'font.size' ] = 26 mayer_multiple = btc_price_data / moving_averages mayer_multiple [ \"High\" ] . plot () plt . title ( \"Mayer Mutliple over time\" ) plt . ylabel ( \"Mayer Mutliple\" ) plt . xlabel ( \"Time\" ) print ( f \"Mayer multiple { mayer_multiple . iloc [ - 1 ][ 'High' ] } \" ) print ( f \"Mayer multiple average { mayer_multiple . mean ()[ 'High' ] } \" ) Mayer multiple 0.5108290958630005 Mayer multiple average 1.3789102045356179 Lastly, I wanted to plot the distribution of the Mayer multiple import numpy as np x = mayer_multiple [ \"High\" ] . values x = x [ ~ np . isnan ( x )] n , a , patches = plt . hist ( x , 100 , facecolor = 'green' , alpha = 0.75 , density = True ) plt . axvline ( x = 2.4 , color = \"red\" ) plt . annotate ( 'We are here today' , xy = ( mayer_multiple . iloc [ - 1 ][ \"High\" ], n [( np . abs ( bins - mayer_multiple . iloc [ - 1 ][ \"High\" ])) . argmin ()]), xytext = ( mayer_multiple . iloc [ - 1 ][ \"High\" ] * 3 , n . max () / 2 ), arrowprops = dict ( facecolor = 'black' , shrink = 0.05 ), ) plt . title ( \"Distribution of the Mayer mutliple\" ) plt . plot () []","tags":"Cryptocurrencies","url":"redoules.github.io/cryptocurrencies/mayer_multiple.html","loc":"redoules.github.io/cryptocurrencies/mayer_multiple.html"},{"title":"Load a saved model","text":"A keras model can be loaded from a hdf5 file. Becareful, a keras generated hdf5 can contain either : the model weights (obtained by the .save_weights() method) the model weights and the model architecture (obtained by the .save() method) In our case, we will load both the model and its weights with the load_model function from the keras.models module from keras.models import load_model model = load_model ( \"my_model.h5\" ) type ( model ) keras.engine.sequential.Sequential","tags":"DL","url":"redoules.github.io/dl/keras_load.html","loc":"redoules.github.io/dl/keras_load.html"},{"title":"Get model input shape","text":"Information on the model can be accessed by the .input_shape parameter of the model object from keras.models import load_model model = load_model ( \"my_model.h5\" ) model . input_shape (None, 28, 28)","tags":"DL","url":"redoules.github.io/dl/keras_input.html","loc":"redoules.github.io/dl/keras_input.html"},{"title":"Get model info and number of parameters","text":"Information on the model can be accessed by the .summary() method from keras.models import load_model model = load_model ( \"my_model.h5\" ) model . summary () _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= flatten_2 (Flatten) (None, 784) 0 _________________________________________________________________ dense_3 (Dense) (None, 128) 100480 _________________________________________________________________ dense_4 (Dense) (None, 10) 1290 ================================================================= Total params: 101,770 Trainable params: 101,770 Non-trainable params: 0 _________________________________________________________________","tags":"DL","url":"redoules.github.io/dl/keras_info.html","loc":"redoules.github.io/dl/keras_info.html"},{"title":"Saving the models weights after each epoch","text":"Let's see how we can save the model weights after every epoch. Let's first import some libraries import keras import numpy as np In this example, we will be using the fashion MNIST dataset to do some basic computer vision, where we will train a Keras neural network to classify items of clothing. In order to import the data we will be using the built in function in Keras : keras . datasets . fashion_mnist . load_data () The model is a very simple neural network consisting in 2 fully connected layers. The model loss function is chosen in order to have a multiclass classifier : \"sparse_categorical_crossentropy\" Let's define a simple feedforward network. ##get and preprocess the data fashion_mnist = keras . datasets . fashion_mnist ( train_images , train_labels ), ( test_images , test_labels ) = fashion_mnist . load_data () train_images = train_images / 255.0 test_images = test_images / 255.0 ## define the model model = keras . Sequential ([ keras . layers . Flatten ( input_shape = ( 28 , 28 )), keras . layers . Dense ( 128 , activation = \"relu\" ), keras . layers . Dense ( 10 , activation = \"softmax\" ) ]) model . compile ( optimizer = \"adam\" , loss = \"sparse_categorical_crossentropy\" , metrics = [ \"acc\" ]) In order to automatically save the model weights to a hdf5 format after every epoch we need to import the ModelCheckpoint callback located in keras . callbacks from keras.callbacks import ModelCheckpoint We now need to define the ModelCheckPoint callback, it takes 7 arguments : filepath: string, path to save the model file. monitor: quantity to monitor verbose: verbosity mode, 0 or 1 save_best_only: if save_best_only=True , the latest best model according to the quantity monitored will not be overwritten. mode: one of {auto, min, max} If save_best_only=True , the decision to overwrite the current save file is made based on either the maximization or the minimization of the monitored quantity. save_weights_only: if True, then only the model's weights will be saved ( model.save_weights(filepath) ), else the full model is saved ( model.save(filepath) ). period: Interval (number of epochs) between checkpoints. The callback has to be added to the callbacks list in the fit method. save_to_hdf5 = ModelCheckpoint ( filepath = \"my_model.h5\" , monitor = 'acc' , verbose = 0 , save_best_only = True , save_weights_only = False , mode = 'auto' , period = 1 ) model . fit ( train_images , train_labels , epochs = 5 , callbacks = [ save_to_hdf5 ]) Epoch 1 / 5 60000 / 60000 [ ============================== ] - 29 s 478 us / step - loss : 0 . 3975 - acc : 0 . 8575 Epoch 2 / 5 60000 / 60000 [ ============================== ] - 98 s 2 ms / step - loss : 0 . 3498 - acc : 0 . 8721 Epoch 3 / 5 60000 / 60000 [ ============================== ] - 95 s 2 ms / step - loss : 0 . 3213 - acc : 0 . 8825 Epoch 4 / 5 60000 / 60000 [ ============================== ] - 64 s 1 ms / step - loss : 0 . 3021 - acc : 0 . 8887 : 6 s - loss : Epoch 5 / 5 60000 / 60000 [ ============================== ] - 61 s 1 ms / step - loss : 0 . 2855 - acc : 0 . 8953 : 4 s - loss : 0 . 284 - ETA : 3 s - loss : - ETA : < keras . callbacks . History at 0 x1d70aa62f28 > import os if \"my_model.h5\" in os . listdir () : print ( \"The model is saved to my_model.h5\" ) else : print ( 'no model saved to disk' ) The model is saved to my_model.h5","tags":"DL","url":"redoules.github.io/dl/model_checkpoint_keras.html","loc":"redoules.github.io/dl/model_checkpoint_keras.html"},{"title":"Logging the training progress in a CSV","text":"Let's see how we can log the progress and various metrics during the training process to a csv file. Let's first import some libraries import keras import numpy as np In this example, we will be using the fashion MNIST dataset to do some basic computer vision, where we will train a Keras neural network to classify items of clothing. In order to import the data we will be using the built in function in Keras : keras . datasets . fashion_mnist . load_data () The model is a very simple neural network consisting in 2 fully connected layers. The model loss function is chosen in order to have a multiclass classifier : \"sparse_categorical_crossentropy\" Let's define a simple feedforward network. ##get and preprocess the data fashion_mnist = keras . datasets . fashion_mnist ( train_images , train_labels ), ( test_images , test_labels ) = fashion_mnist . load_data () train_images = train_images / 255.0 test_images = test_images / 255.0 ## define the model model = keras . Sequential ([ keras . layers . Flatten ( input_shape = ( 28 , 28 )), keras . layers . Dense ( 128 , activation = \"relu\" ), keras . layers . Dense ( 10 , activation = \"softmax\" ) ]) model . compile ( optimizer = \"adam\" , loss = \"sparse_categorical_crossentropy\" , metrics = [ \"accuracy\" , 'mae' ]) In order to stream to a csv file the epoch results and metrics, we define a CSV logger. It is a callback located in keras . callbacks Let's first import it from keras.callbacks import CSVLogger We now need to define the callback by specifiying a file to be written to, the separator and whether to append to the file or erase it every time. The callback has to be added to the callbacks list in the fit method. csv_logger = CSVLogger ( filename = \"my_csv.csv\" , separator = ';' , append = False ) model . fit ( train_images , train_labels , epochs = 5 , callbacks = [ csv_logger ]) Epoch 1 / 5 60000 / 60000 [ ============================== ] - 9 s 148 us / step - loss : 0 . 5020 - acc : 0 . 8234 - mean_absolute_error : 4 . 4200 Epoch 2 / 5 60000 / 60000 [ ============================== ] - 8 s 138 us / step - loss : 0 . 3765 - acc : 0 . 8630 - mean_absolute_error : 4 . 4200 Epoch 3 / 5 60000 / 60000 [ ============================== ] - 8 s 129 us / step - loss : 0 . 3371 - acc : 0 . 8789 - mean_absolute_error : 4 . 4200 Epoch 4 / 5 60000 / 60000 [ ============================== ] - 8 s 133 us / step - loss : 0 . 3129 - acc : 0 . 8843 - mean_absolute_error : 4 . 4200 Epoch 5 / 5 60000 / 60000 [ ============================== ] - 9 s 151 us / step - loss : 0 . 2952 - acc : 0 . 8916 - mean_absolute_error : 4 . 4200 < keras . callbacks . History at 0 x1582adc6780 > The results are stored in the my_csv.csv file and contain the epoch results import pandas as pd pd . read_csv ( \"my_csv.csv\" , sep = \";\" ) epoch acc loss mean_absolute_error 0 0 0.823400 0.502013 4.42 1 1 0.863050 0.376516 4.42 2 2 0.878867 0.337097 4.42 3 3 0.884317 0.312893 4.42 4 4 0.891583 0.295157 4.42","tags":"DL","url":"redoules.github.io/dl/csv_logger_keras.html","loc":"redoules.github.io/dl/csv_logger_keras.html"},{"title":"Random integer","text":"The random module comes with the randint function that returns a pseudo random number between 2 values : # Return a random integer N such that a <= N <= b. random . randint ( a , b ) #### import random min_int = 126 max_int = 211 print ( f \"My pseudo random number between { min_int } and { max_int } : { random . randint ( min_int , max_int ) } \" ) My pseudo random number between 126 and 211 : 206","tags":"Python","url":"redoules.github.io/python/randint.html","loc":"redoules.github.io/python/randint.html"},{"title":"Using tensorboard with Keras","text":"general workflow Let's see how we can get tensorboard to work with a Keras-based Tensorflow code. import tensorflow as tf import keras import numpy as np In this example, we will be using the fashion MNIST dataset to do some basic computer vision, where we will train a Keras neural network to classify items of clothing. In order to import the data we will be using the built in function in Keras : keras . datasets . fashion_mnist . load_data () The model is a very simple neural network consisting in 2 fully connected layers. The model loss function is chosen in order to have a multiclass classifier : \"sparse_categorical_crossentropy\" Finally, let's train the model for 5 epochs ##get and preprocess the data fashion_mnist = keras . datasets . fashion_mnist ( train_images , train_labels ), ( test_images , test_labels ) = fashion_mnist . load_data () train_images = train_images / 255.0 test_images = test_images / 255.0 ## define the model model = keras . Sequential ([ keras . layers . Flatten ( input_shape = ( 28 , 28 )), keras . layers . Dense ( 128 , activation = \"relu\" ), keras . layers . Dense ( 10 , activation = \"softmax\" ) ]) model . compile ( optimizer = \"adam\" , loss = \"sparse_categorical_crossentropy\" , metrics = [ \"accuracy\" ]) model . fit ( train_images , train_labels , epochs = 5 ) Epoch 1 / 5 60000 / 60000 [ ============================== ] - 9 s 143 us / step - loss : 0 . 4939 - acc : 0 . 8254 Epoch 2 / 5 60000 / 60000 [ ============================== ] - 11 s 182 us / step - loss : 0 . 3688 - acc : 0 . 8661 Epoch 3 / 5 60000 / 60000 [ ============================== ] - 10 s 169 us / step - loss : 0 . 3305 - acc : 0 . 8798 Epoch 4 / 5 60000 / 60000 [ ============================== ] - 21 s 350 us / step - loss : 0 . 3079 - acc : 0 . 8874 Epoch 5 / 5 60000 / 60000 [ ============================== ] - 18 s 302 us / step - loss : 0 . 2889 - acc : 0 . 8927 < keras . callbacks . History at 0 x235c1bc1be0 > During the training we can see the process, including the loss and the accuracy in the output. test_loss , test_acc = model . evaluate ( test_images , test_labels ) print ( f \"Test accuracy : { test_acc } \" ) 10000/10000 [==============================] - 1s 67us/step Test accuracy : 0.8763 When the model finishes training, we get an accuracy of about 87%, and we output some sample predictions predictions = model . predict ( test_images ) print ( predictions [ 0 ]) [1.8075149e-05 3.6810281e-08 6.3094416e-07 5.1111499e-07 1.6264809e-06 3.5973577e-04 1.0840570e-06 3.1453002e-02 1.7062060e-06 9.6816361e-01] This kind of process only gives us minimal information during the training process. Setting up tensorboard To make it easier to understand, debug, and optimize TensorFlow programs, a suite of visualization tools called TensorBoard is included. You can use TensorBoard to visualize your TensorFlow graph, plot quantitative metrics about the execution of your graph, and show additional data like images that pass through it. When TensorBoard is fully configured, it looks like this: Let's start by importing the time library and tensorboard itself. It can be found in tensorflow.python.keras.callbacks. from time import time from tensorflow.python.keras.callbacks import TensorBoard After having imported our data and defined our model, we specify a log directory where the training information will get written to. #keep in mind that we already imported the data and defined the model. tensorboard = TensorBoard ( log_dir = f \"logs/ { time () } \" ) Finally, to tell Keras to call back to TensorBoard we refer to the instant of TensorBoard we created. model . compile ( optimizer = \"adam\" , loss = \"sparse_categorical_crossentropy\" , metrics = [ \"accuracy\" ]) Now, we need to execute the TensorBoard command pointing at the log directory previously specified. tensorboard --logdir = logs/ TensorBoard will return a http address TensorBoard 1.12.0 at http://localhost:6006 (Press CTRL+C to quit) Now, if we retrain again, we can take a look in TensorBoard and start investigating the loss and accuracy model . fit ( train_images , train_labels , epochs = 5 , callbacks = [ tensorboard ]) Epoch 1 / 5 60000 / 60000 [ ============================== ] - 41 s 684 us / step - loss : 0 . 4990 - acc : 0 . 8241 Epoch 2 / 5 60000 / 60000 [ ============================== ] - 49 s 812 us / step - loss : 0 . 3765 - acc : 0 . 8648 Epoch 3 / 5 60000 / 60000 [ ============================== ] - 46 s 765 us / step - loss : 0 . 3392 - acc : 0 . 8766 Epoch 4 / 5 60000 / 60000 [ ============================== ] - 48 s 794 us / step - loss : 0 . 3135 - acc : 0 . 8836 Epoch 5 / 5 60000 / 60000 [ ============================== ] - 49 s 813 us / step - loss : 0 . 2971 - acc : 0 . 8897 < keras . callbacks . History at 0 x235be1c76d8 > TensorBoard also give access to a dynamic visualization of the graph","tags":"DL","url":"redoules.github.io/dl/tensorboard_keras.html","loc":"redoules.github.io/dl/tensorboard_keras.html"},{"title":"Install keras using conda","text":"Keras with the Tensorflow backend can be installed by running the following conda command conda install -c conda-forge keras tensorflow If you want a Intel CPU optimized version, install tensorflow-mkl conda install -c conda-forge keras tensorflow-mkl A GPU compatible version is also available conda install -c conda-forge keras tensorflow-gpu","tags":"DL","url":"redoules.github.io/dl/keras_install.html","loc":"redoules.github.io/dl/keras_install.html"},{"title":"Day 9 - Multiple Linear Regression","text":"Problem Here is a simple equation: $$Y=a+b_1\\cdot f_1++b_2\\cdot f_2+...++b_m\\cdot f_m$$ $$Y=a+\\sum_{i=1}^m b_i\\cdot f_i$$ for \\((m+1)\\) read constants \\((a,f_1, f_2, ..., f_m)\\) . We can say that the value of \\(Y\\) depends on \\(m\\) features. We study this equation for \\(n\\) different feature sets \\((f_1, f_2, ..., f_m)\\) and records each respective value of \\(Y\\) . If we have \\(q\\) new feature sets, and without accounting for bias and variance trade-offs,what is the value of \\(Y\\) for each of the sets? Python implementation import numpy as np m = 2 n = 7 x_1 = [ 0.18 , 0.89 ] y_1 = 109.85 x_2 = [ 1.0 , 0.26 ] y_2 = 155.72 x_3 = [ 0.92 , 0.11 ] y_3 = 137.66 x_4 = [ 0.07 , 0.37 ] y_4 = 76.17 x_5 = [ 0.85 , 0.16 ] y_5 = 139.75 x_6 = [ 0.99 , 0.41 ] y_6 = 162.6 x_7 = [ 0.87 , 0.47 ] y_7 = 151.77 q_1 = [ 0.49 , 0.18 ] q_2 = [ 0.57 , 0.83 ] q_3 = [ 0.56 , 0.64 ] q_4 = [ 0.76 , 0.18 ] With scikit learn X = np . array ([ x_1 , x_2 , x_3 , x_4 , x_5 , x_6 , x_7 ]) Y = np . array ([ y_1 , y_2 , y_3 , y_4 , y_5 , y_6 , y_7 ]) X_q = np . array ([ q_1 , q_2 , q_3 , q_4 ]) from sklearn import linear_model lm = linear_model . LinearRegression () lm . fit ( X , Y ) lm . predict ( X_q ) array([105.21455835, 142.67095131, 132.93605469, 129.70175405]) without scikit learn (but with numpy) from numpy.linalg import inv #center X_R = X - np . mean ( X , axis = 0 ) a = np . mean ( Y ) Y_R = Y - a #calculate b B = inv ( X_R . T @X_R ) @X_R . T @Y_R #predict X_new_R = X_q - np . mean ( X , axis = 0 ) Y_new_R = X_new_R @B Y_new = Y_new_R + a Y_new array([105.21455835, 142.67095131, 132.93605469, 129.70175405])","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day9.html","loc":"redoules.github.io/blog/Statistics_10days-day9.html"},{"title":"Multiple Linear Regression","text":"If \\(Y\\) is linearly dependent only on \\(X\\) , then we can use the ordinary least square regression line, \\(\\hat{Y}=a+bX\\) . However, if \\(Y\\) shows linear dependency on \\(m\\) variables \\(X_1\\) , \\(X_2\\) , ..., \\(X_m\\) , then we need to find the values of \\(a\\) and \\(m\\) other constants ( \\(b_1\\) , \\(b_2\\) , ..., \\(b_m\\) ). We can then write the regression equation as: $$\\hat{Y}=a+\\sum_{i=1}^{m}b_iX_i$$ Matrix Form of the Regression Equation Let's consider that \\(Y\\) depends on two variables, \\(X_1\\) and \\(X_2\\) . We write the regression relation as \\(\\hat{Y}=a+b_1X_1+b_2X_2\\) . Consider the following matrix operation: $$\\begin{bmatrix} 1 & X_1 & X_2\\\\ \\end{bmatrix}\\cdot\\begin{bmatrix} a \\\\ b_1\\\\ b_2\\\\ \\end{bmatrix}=a+b_1X_1+b_2X_2$$ We define two matrices, \\(X\\) and \\(B\\) as: $$X=\\begin{bmatrix}1 & X_1 & X_2\\\\\\end{bmatrix}$$ $$B=\\begin{bmatrix}a \\\\b_1\\\\b_2\\\\\\end{bmatrix}$$ Now, we rewrite the regression relation as \\(\\hat{Y}=X\\cdot B\\) . This transforms the regression relation into matrix form. Generalized Matrix Form We will consider that \\(Y\\) shows a linear relationship with \\(m\\) variables, \\(X_1\\) , \\(X_2\\) , ..., \\(X_m\\) . Let's say that we made \\(n\\) observations on different tuples \\((x_1, x_2, ..., x_m)\\) : \\(y_1=a+b_1\\cdot x_{1,1} + b_2\\cdot x_{2,1} + ... + b_m\\cdot x_{m,1}\\) \\(y_2=a+b_2\\cdot x_{1,2} + b_2\\cdot x_{2,2} + ... + b_m\\cdot x_{m,2}\\) \\(...\\) \\(y_n=a+b_n\\cdot x_{1,n} + b_2\\cdot x_{2,n} + ... + b_m\\cdot x_{m,n}\\) Now, we can find the matrices: $$X=\\begin{bmatrix}1 & x_{1,1} & x_{2,1} & x_{3,1} & ... & x_{m,1} \\\\1 & x_{1,2} & x_{2,2} & x_{3,2} & ... & x_{m,2} \\\\1 & x_{1,3} & x_{2,3} & x_{3,3} & ... & x_{m,3} \\\\... & ... & ... & ... & ... & ... \\\\1 & x_{1,n} & x_{2,n} & x_{3,n} & ... & x_{m,n} \\\\\\end{bmatrix}$$ $$Y=\\begin{bmatrix}y_1 \\\\y_2\\\\y_3\\\\...\\\\y_n\\\\\\end{bmatrix}$$ Finding the Matrix B We know that \\(Y=X\\cdot B\\) $$\\Rightarrow X^T\\cdot Y=X^T\\cdot X \\cdot B$$ $$\\Rightarrow (X^T\\cdot X)^{-1}\\cdot X^T \\cdot Y=I\\cdot B$$ $$\\Rightarrow B= (X^T\\cdot X)^{-1}\\cdot X^T \\cdot Y$$ Finding the Value of Y Suppose we want to find the value of for some tuple \\(Y\\) , then \\((x_1, x_2, ..., x_m)\\) , $$Y=\\begin{bmatrix} 1 & x_1 & x_2 & ... & x_m\\\\ \\end{bmatrix}\\cdot B$$ Multiple Regression in Python We can use the fit function in the sklearn.linear_model.LinearRegression class. from sklearn import linear_model x = [[ 5 , 7 ], [ 6 , 6 ], [ 7 , 4 ], [ 8 , 5 ], [ 9 , 6 ]] y = [ 10 , 20 , 60 , 40 , 50 ] lm = linear_model . LinearRegression () lm . fit ( x , y ) a = lm . intercept_ b = lm . coef_ print ( f \"Linear regression coefficients between Y and X : a= { a } , b_0= { b [ 0 ] } , b_1= { b [ 1 ] } \" ) Linear regression coefficients between Y and X : a=51.953488372092984, b_0=6.65116279069768, b_1=-11.162790697674419","tags":"Machine Learning","url":"redoules.github.io/machine-learning/Multiple_Linear_Regression.html","loc":"redoules.github.io/machine-learning/Multiple_Linear_Regression.html"},{"title":"Least Square Regression Line","text":"Linear Regression If our data shows a linear relationship between \\(X\\) and \\(Y\\) , then the straight line which best describes the relationship is the regression line. The regression line is given by \\(\\hat{Y}\\) =a+bX$. Finding the value of b The value of \\(b\\) can be calculated using either of the following formulae: \\(b=\\frac{n\\sum(x_iy_i)-(\\sum x_i)(\\sum y_i)}{n\\sum(x_i^2)-(\\sum x_i)^2}\\) \\(b=\\rho\\frac{\\sigma_Y}{\\sigma_X}\\) , where \\(\\rho\\) is the Pearson correlation coefficient, \\(\\sigma_X\\) Finding the value of a \\(a=\\bar{y}-b\\cdot\\bar{x}\\) , where \\(\\bar{x}\\) is the mean of \\(X\\) and \\(\\bar{y}\\) is the mean of \\(Y\\) . Coefficient of determination ( \\(R^2\\) ) The coefficient of determination can be computer with : \\(R^2 = \\frac{SSR}{SST}=1-\\frac{SSE}{SST}\\) Where : \\(SST\\) is the total Sum of Squares : \\(SST=\\sum (y_i-\\bar{y})^2\\) \\(SSR\\) is the regression Sum of Squares : \\(SSR=\\sum (\\hat{y_i}-\\bar{y})^2\\) \\(SSE\\) is the error Sum of Squares : \\(SSE=\\sum (\\hat{y_i}-y)^2\\) If \\(SSE\\) is small, we can assume that our fit is good. Linear Regression in Python We can use the fit function in the sklearn.linear_model.LinearRegression class. from sklearn import linear_model import numpy as np xl = [ 1 , 2 , 3 , 4 , 5 ] x = np . asarray ( xl ) . reshape ( - 1 , 1 ) y = [ 2 , 1 , 4 , 3 , 5 ] lm = linear_model . LinearRegression () lm . fit ( x , y ) print ( f 'a = { lm . intercept_ } ' ) print ( f 'b = { lm . coef_ [ 0 ] } ' ) print ( \"Where Y=a+b*X\" ) a = 0.5999999999999996 b = 0.8000000000000002 Where Y=a+b*X","tags":"Machine Learning","url":"redoules.github.io/machine-learning/LeastSquareRegressionLine.html","loc":"redoules.github.io/machine-learning/LeastSquareRegressionLine.html"},{"title":"Day 8 - Least Square Regression Line","text":"Least Square Regression Line Problem A group of five students enrolls in Statistics immediately after taking a Math aptitude test. Each student's Math aptitude test score, \\(x\\) , and Statistics course grade, \\(y\\) , can be expressed as the following list \\((x,y)\\) of points: \\((95, 85)\\) \\((85, 95)\\) \\((80, 70)\\) \\((70, 65)\\) \\((60, 70)\\) If a student scored an 80 on the Math aptitude test, what grade would we expect them to achieve in Statistics? Determine the equation of the best-fit line using the least squares method, then compute and print the value of \\(y\\) when \\(x=80\\) . X = [ 95 , 85 , 80 , 70 , 60 ] Y = [ 85 , 95 , 70 , 65 , 70 ] n = len ( X ) def cov ( X , Y , n ): x_mean = 1 / n * sum ( X ) y_mean = 1 / n * sum ( Y ) return 1 / n * sum ([( X [ i ] - x_mean ) * ( Y [ i ] - y_mean ) for i in range ( n )]) def stdv ( X , mu_x , n ): return ( sum ([( x - mu_x ) ** 2 for x in X ]) / n ) ** 0.5 def pearson_1 ( X , Y , n ): std_x = stdv ( X , 1 / n * sum ( X ), n ) std_y = stdv ( Y , 1 / n * sum ( Y ), n ) return cov ( X , Y , n ) / ( std_x * std_y ) b = pearson_1 ( X , Y , n ) * stdv ( Y , sum ( Y ) / n , n ) / stdv ( X , sum ( X ) / n , n ) a = sum ( Y ) / n - b * sum ( X ) / n print ( f \"If a student scored 80 on the math test, he would most likely score a { round ( a + 80 * b , 3 ) } in statistics\" ) If a student scored 80 on the math test, he would most likely score a 78.288 in statistics Pearson correlation coefficient Problem The regression line of \\(y\\) on \\(x\\) is \\(3x+4y+8=0\\) , and the regression line of \\(x\\) on \\(y\\) is \\(4x+3y+7=0\\) . What is the value of the Pearson correlation coefficient? Mathematical explanation The initial equation system is : $$ \\left\\{\\begin{array}{ r @{{}={}} r >{{}}c<{{}} r >{{}}c<{{}} r } 3x+4y+8=0 & (1)\\\\ 4x+3y+7=0 & (2)\\\\ \\end{array} \\right. $$ So we can rewrite the 2 lines this way : $$ \\left\\{\\begin{array}{ r @{{}={}} r >{{}}c<{{}} r >{{}}c<{{}} r } y=-2+(\\frac{-3}{4})x & (1)\\\\ x=-\\frac{7}{4}+(-\\frac{3}{4})y & (2)\\\\ \\end{array} \\right. $$ so \\(b_1=-\\frac{3}{4}\\) and \\(b_2=-\\frac{3}{4}\\) When we apply the Pearson's coefficient formula : let \\(p\\) be the pearson coefficient let \\(\\sigma_X\\) be the standard deviation of \\(x\\) let \\(\\sigma_Y\\) be the standard deviation of \\(y\\) We hence have $$ \\left\\{\\begin{array}{ r @{{}={}} r >{{}}c<{{}} r >{{}}c<{{}} r } p=b_1\\left(\\frac{\\sigma_X}{\\sigma_Y}\\right) & (1)\\\\ p=b_2\\left(\\frac{\\sigma_Y}{\\sigma_X}\\right) & (2)\\\\ \\end{array} \\right. $$ by multiplying theses 2 equations together we get $$p^2=b_1\\cdot b_2$$ $$p^2=\\left(-\\frac{3}{4}\\right)\\left(-\\frac{3}{4}\\right)$$ $$p^2=\\left(-\\frac{9}{16}\\right)$$ finally we get \\(p=\\left(-\\frac{3}{4}\\right)\\) or \\(p=\\left(\\frac{3}{4}\\right)\\) Since \\(X\\) and \\(Y\\) are negatively correlated we have \\(p=\\left(-\\frac{3}{4}\\right)\\)","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day8.html","loc":"redoules.github.io/blog/Statistics_10days-day8.html"},{"title":"Spearman's Rank Correlation Coefficient","text":"A rank correlation is any of several statistics that measure an ordinal association—the relationship between rankings of different ordinal variables or different rankings of the same variable, where a \"ranking\" is the assignment of the ordering labels \"first\", \"second\", \"third\", etc. to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. We have two random variables \\(X\\) and \\(Y\\) : * \\(X=\\{x_i, x_2, x_3, ..., x_n\\}\\) * \\(Y=\\{y_i, y_2, y_3, ..., y_n\\}\\) if \\(Rank_X\\) and \\(Rank_Y\\) denote the respective ranks of each data point, then the Spearman's rank correlation coefficient, \\(r_s\\) , is the Pearson correlation coefficient of \\(Rank_X\\) and \\(Rank_Y\\) . What does it means? The Spearman's rank correlation coefficientis is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function. The Spearman correlation between two variables is equal to the Pearson correlation between the rank values of those two variables; while Pearson's correlation assesses linear relationships, Spearman's correlation assesses monotonic relationships (whether linear or not). If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other. A Spearman correlation of 1 results when the two variables being compared are monotonically related, even if their relationship is not linear. This means that all data-points with greater x-values than that of a given data-point will have greater y-values as well. In contrast, this does not give a perfect Pearson correlation. Example \\(X=\\{0.2, 1.3, 0.2, 1.1, 1.4, 1.5\\}\\) \\(Y=\\{1.9, 2.2, 3.1, 1.2, 2.2, 2.2\\}\\) $$ Rank_X \\quad \\begin{bmatrix} X: & 0.2 & 1.3 & 0.2 & 1.1 & 1.4 & 1.5 \\\\ Rank: & 1 & 3 & 1 & 2 & 4 & 5 \\end{bmatrix} \\quad $$ so, \\(Rank_X = \\{1, 3, 1, 2, 4, 5\\}\\) similarly, \\(Rank_Y=\\{2,3,4,1,3,3\\}\\) \\(r_s\\) equals the Pearson correlation coefficient of \\(Rank_X\\) and \\(Rank_Y\\) , meaning that \\(r=0.158114\\) Special case : \\(X\\) and \\(Y\\) don't contain duplicates $$r_s=1-\\frac{6\\sum d_i^2}{n(n^2-1)}$$ Where, \\(d_i\\) is the difference between the respective values of \\(Rank_X\\) and \\(Rank_Y\\) .","tags":"Mathematics","url":"redoules.github.io/mathematics/spearman.html","loc":"redoules.github.io/mathematics/spearman.html"},{"title":"Pearson correlation coefficient","text":"Covariance This is a measure of how two random variables change together, or the strength of their correlation. Consider two random variables, \\(X\\) and \\(Y\\) , each with \\(n\\) values (i.e., \\(x_1\\) , \\(x_2\\) , \\(...\\) , \\(x_n\\) and \\(y_1\\) , \\(y_2\\) , \\(...\\) , \\(y_n\\) ). The covariance of \\(X\\) and \\(Y\\) can be found using either of the following equivalent formulas: $$cov(X,Y)=\\frac{1}{n}\\sum_{i=1}^{n}(x_i-\\bar{x})\\cdot(y_i-\\bar{y})$$ or $$cov(X,Y)=\\frac{1}{n^2}\\sum_{i=1}^{n}\\sum_{j=1}^{n}\\frac{1}{2}(x_i-x_j)\\cdot(y_i-y_j))$$ $$cov(X,Y)=\\frac{1}{n^2}\\sum_{i}\\sum_{j\\gt i}^{n}(x_i-x_j)\\cdot(y_i-y_j)$$ where, \\(\\bar{x}\\) is the mean of \\(X\\) (or \\(\\mu_X\\) ) and \\(\\bar{y}\\) is the mean of \\(Y\\) (or \\(\\mu_Y\\) ) Pearson correlation coefficient The pearson correlation coefficient, \\(\\rho_{X,Y}\\) , is given by : $$\\rho_{X,Y}=\\frac{cov(X,Y)}{\\sigma_X\\sigma_Y}=\\frac{\\sum_{i}(x_i-\\bar{x})(y_i-\\bar{y})}{n\\sigma_X\\sigma_Y}$$ Here, \\(\\sigma_X\\) is the standard deviation of \\(X\\) and \\(\\sigma_Y\\) is the standard deviation of \\(Y\\) . You may also see \\(\\rho_{X,Y}\\) written as \\(r_{X,Y}\\) . The pearson correlation coefficient is a measure of the linear correlation between two variables X and Y.","tags":"Mathematics","url":"redoules.github.io/mathematics/pearson.html","loc":"redoules.github.io/mathematics/pearson.html"},{"title":"Day 7 - Pearson and spearman correlations","text":"Pearson correlation coefficient Problem Given two n-element data sets, \\(X\\) and \\(Y\\) , calculate the value of the Pearson correlation coefficient. Python implementation Using the formula $$\\rho_{X,Y}=\\frac{cov(X,Y)}{\\sigma_X\\sigma_Y}$$ where $$cov(X,Y)=\\frac{1}{n}\\sum_{i=1}^{n}(x_i-\\bar{x})\\cdot(y_i-\\bar{y})$$ n = 10 X = [ 10 , 9.8 , 8 , 7.8 , 7.7 , 7 , 6 , 5 , 4 , 2 ] Y = [ 200 , 44 , 32 , 24 , 22 , 17 , 15 , 12 , 8 , 4 ] def cov ( X , Y , n ): x_mean = 1 / n * sum ( X ) y_mean = 1 / n * sum ( Y ) return 1 / n * sum ([( X [ i ] - x_mean ) * ( Y [ i ] - y_mean ) for i in range ( n )]) def stdv ( X , mu_x , n ): return ( sum ([( x - mu_x ) ** 2 for x in X ]) / n ) ** 0.5 def pearson_1 ( X , Y , n ): std_x = stdv ( X , 1 / n * sum ( X ), n ) std_y = stdv ( Y , 1 / n * sum ( Y ), n ) return cov ( X , Y , n ) / ( std_x * std_y ) pearson_1 ( X , Y , n ) 0.6124721937208479 Python implementation Using the formula $$\\rho_{X,Y}=\\frac{\\sum_{i}(x_i-\\bar{x})(y_i-\\bar{y})}{n\\sigma_X\\sigma_Y}$$ def pearson_2 ( X , Y , n ): std_x = stdv ( X , 1 / n * sum ( X ), n ) std_y = stdv ( Y , 1 / n * sum ( Y ), n ) x_mean = 1 / n * sum ( X ) y_mean = 1 / n * sum ( Y ) return sum ([( X [ i ] - x_mean ) * ( Y [ i ] - y_mean ) for i in range ( n )]) / ( n * std_x * std_y ) pearson_2 ( X , Y , n ) 0.6124721937208479 Spearman's rank correlation coefficient Problem Given two \\(n\\) -element data sets, \\(X\\) and \\(Y\\) , calculate the value of Spearman's rank correlation coefficient. Python implementation We knwo that in this case, the values in each dataset are unique. Hence we can use the formula : $$r_s=1-\\frac{6\\sum d_i^2}{n(n^2-1)}$$ n = 10 X = [ 10 , 9.8 , 8 , 7.8 , 7.7 , 1.7 , 6 , 5 , 1.4 , 2 ] Y = [ 200 , 44 , 32 , 24 , 22 , 17 , 15 , 12 , 8 , 4 ] def spearman_rank ( X , Y , n ): rank_X = [ sorted ( X ) . index ( v ) + 1 for v in X ] rank_Y = [ sorted ( Y ) . index ( v ) + 1 for v in Y ] d = [( rank_X [ i ] - rank_Y [ i ]) ** 2 for i in range ( n )] return 1 - ( 6 * sum ( d )) / ( n * ( n * n - 1 )) spearman_rank ( X , Y , n ) 0.9030303030303031","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day7.html","loc":"redoules.github.io/blog/Statistics_10days-day7.html"},{"title":"Day 6 - The Central Limit Theorem","text":"Problem 1 A large elevator can transport a maximum of \\(9800\\) kg. Suppose a load of cargo containing \\(49\\) boxes must be transported via the elevator. The box weight of this type of cargo follows a distribution with a mean of \\(\\mu=205\\) kg and a standard deviation of \\(\\sigma=15\\) kg. Based on this information, what is the probability that all boxes can be safely loaded into the freight elevator and transported? Mathematical explanation This problem can be tackled with the central limit theorem. Since the number of boxes is \"large\", the sum of the weight approaches normal distribution with : * \\(\\mu' = n\\mu\\) * \\(\\sigma'=\\sigma\\sqrt{n}\\) If we want to know the probability of the sum of the mass of all boxes to be under a certain weight, we can compute the cumulative density function : $$P(x<9800) = F_X(9800)$$ max_load = 9800 n = 49 mu = 205 st_dev = 15 import math def cumulative ( x , mean , sd ): return 0.5 * ( 1 + math . erf (( x - mean ) / ( sd * math . sqrt ( 2 )))) mu_group = n * mu st_dev_group = st_dev * math . sqrt ( n ) print ( f \"Probability that all the boxes can be lifted by the elevator : { cumulative ( max_load , mu_group , st_dev_group ) } \" ) Probability that all the boxes can be lifted by the elevator : 0.009815328628645315 Problem 2 The number of tickets purchased by each student for the University X vs. University Y football game follows a distribution that has a mean of \\(\\mu=2.4\\) and a standard deviation of \\(\\sigma=2.0\\) . A few hours before the game starts, \\(100\\) eager students line up to purchase last-minute tickets. If there are only \\(250\\) tickets left, what is the probability that all students will be able to purchase tickets? Mathematical explanation We want to know if the sum of all the purchases will exceed the total supply of tickets. Since each student buy follows a Normal distribution and that the number of students is relatively high, the probability that all the students will be able to buy a ticket can be computed by applying the central limit theorem. The total number of tickets bought follows a normal distribution of mean \\(\\mu'=n*\\mu\\) and of standard deviation \\(\\sigma'=\\sigma\\sqrt{n}\\) ticket_supply = 250 n_students = 100 mu = 2.4 st_dev = 2 mu_group = n_students * mu st_dev_group = st_dev * math . sqrt ( n_students ) print ( f \"Probability that all the students can purchase tickets : { cumulative ( ticket_supply , mu_group , st_dev_group ) } \" ) Probability that all the students can purchase tickets : 0.691462461274013 Problem 3 You have a sample of \\(100\\) values from a population with mean \\(\\mu=500\\) and with standard deviation \\(\\sigma=80\\) . Compute the interval that covers the middle \\(95%\\) of the distribution of the sample mean; in other words, compute \\(A\\) and \\(B\\) such that \\(P(A<x<B)=0.95\\) . Use the value of \\(z=1.96\\) . Note that is the z-score. Mathematical explanation The margin of error can be computed with : $$MoE = \\frac{z-\\sigma}{\\sqrt{n}}$$ Knowing this, we can figure out what values for x paint the exact middle of the distribution for 0.95 probability, that means theres a 0.0025 leftover on both sides to make the total of 1. zScore = 1.96 std = 80 n = 100 mean = 500 marginOfError = zScore * std / math . sqrt ( n ); print ( \"A =\" , mean - marginOfError ) print ( \"B =\" , mean + marginOfError ) A = 484.32 B = 515.68","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day6.html","loc":"redoules.github.io/blog/Statistics_10days-day6.html"},{"title":"Day 5 - Poisson and Normal distributions","text":"Poisson Distribution Problem 1 A random variable, \\(X\\) , follows Poisson distribution with mean of 2.5. Find the probability with which the random variable \\(X\\) is equal to 5. Mathematical explanation In this case, the answer is straightforward, we just need to compute the value of the Poisson distribution of mean 2.5 at 5: $$P(\\lambda = 2.5, x=5)=\\frac{\\lambda^ke^{-\\lambda}}{k!}$$ $$P(\\lambda = 2.5, x=5)=\\frac{2.5^5e^{-2.5}}{5!}$$ def factorial ( k ): return 1 if k == 1 else k * factorial ( k - 1 ) from math import exp def poisson ( l , k ): return ( l ** k * exp ( - l )) / factorial ( k ) l = 2.5 k = 5 print ( f 'Probability that a random variable X following a Poisson distribution of mean { l } equals { k } : { round ( poisson ( l , k ), 3 ) } ' ) Probability that a random variable X following a Poisson distribution of mean 2.5 equals 5 : 0.067 Problem 2 The manager of a industrial plant is planning to buy a machine of either type \\(A\\) or type \\(B\\) . For each day's operation: The number of repairs, \\(X\\) , that machine \\(A\\) needs is a Poisson random variable with mean 0.88. The daily cost of operating \\(A\\) is \\(C_A=160+40X^2\\) . The number of repairs, \\(Y\\) , that machine \\(B\\) needs is a Poisson random variable with mean 1.55. The daily cost of operating \\(B\\) is \\(C_B=128+40Y^2\\) . Assume that the repairs take a negligible amount of time and the machines are maintained nightly to ensure that they operate like new at the start of each day. What is the expected daily cost for each machine. Mathematical explanation The cost for each machine follows a law that is the square of a Poisson distribution. $$C_Z = a + b*Z^2$$ Since the expectation is a linear operator : $$E[C_Z] = aE[1] + bE[Z^2]$$ Knowing that \\(Z\\) follows a Poisson distribution of mean \\(\\lambda\\) we have : $$E[C_Z] = a+ b(\\lambda + \\lambda^2)$$ averageX = 0.88 averageY = 1.55 CostX = 160 + 40 * ( averageX + averageX ** 2 ) CostY = 128 + 40 * ( averageY + averageY ** 2 ) print ( f 'Expected cost to run machine A : { round ( CostX , 3 ) } ' ) print ( f 'Expected cost to run machine A : { round ( CostY , 3 ) } ' ) Expected cost to run machine A : 226.176 Expected cost to run machine A : 286.1 Normal Distribution Problem 1 In a certain plant, the time taken to assemble a car is a random variable, \\(X\\) , having a normal distribution with a mean of 20 hours and a standard deviation of 2 hours. What is the probability that a car can be assembled at this plant in: Less than 19.5 hours? Between 20 and 22 hours? Mathematical explanation \\(X\\) is a real-valued random variable following a normal distribution : the probability of assembly the car in less than 19.5 hours is the cumulative distribution function of X evaluated at 19.5: $$P(X\\leq 19.5)=F_X(19.5)$$ For a normal distribution, the cumulative distribution function is : $$\\Phi(x)=\\frac{1}{2}\\left(1+erf\\left(\\frac{x-\\mu}{\\sigma\\sqrt{2}}\\right)\\right)$$ import math def cumulative ( x , mean , sd ): return 0.5 * ( 1 + math . erf (( x - mean ) / ( sd * math . sqrt ( 2 )))) mean = 20 sd = 2 print ( f 'Probability that the car is built in less than 19.5 hours : { round ( cumulative ( 19.5 , mean , sd ), 3 ) } ' ) Probability that the car is built in less than 19.5 hours : 0.401 Similarly, the probability that a car is built between 20 and 22hours can be computed thanks to the cumulative density function: $$P(20\\leq x\\leq 22) = F_X(22)-F_X(20)$$ print ( f 'Probability that the car is built between 20 and 22 hours : { round ( cumulative ( 22 , mean , sd ) - cumulative ( 20 , mean , sd ), 3 ) } ' ) Probability that the car is built between 20 and 22 hours : 0.341 Problem 2 The final grades for a Physics exam taken by a large group of students have a mean of \\(\\mu=70\\) and a standard deviation of \\(\\sigma=10\\) . If we can approximate the distribution of these grades by a normal distribution, what percentage of the students: * Scored higher than 80 (i.e., have a \\(grade \\gt 80\\) ))? * Passed the test (i.e., have a \\(grade \\gt 60\\) )? * Failed the test (i.e., have a \\(grade \\lt 60\\) )? Mathematical explanation Here again, we need to appy the cumulative density function to get the probabilities : Probability that they scored higher than 80 : $$P(X\\gt80) = 1- P(X\\lt80)$$ $$P(X\\gt80) = 1- F_X(80)$$ mean = 70 sd = 10 print ( f 'Probability that the the student scored higher than 80 : { round ( 1 - cumulative ( 80 , mean , sd ), 3 ) } ' ) Probability that the the student scored higher than 80 : 0.159 Probability that they passed the test : $$P(X\\gt60) = 1- P(X\\lt60)$$ $$P(X\\gt80) = 1- F_X(60)$$ print ( f 'Probability that the the student passed the test : { round ( 1 - cumulative ( 60 , mean , sd ), 3 ) } ' ) Probability that the the student passed the test : 0.841 Probability that they failed : $$P(X\\lt60) = F_X(60)$$ print ( f 'Probability that the student failed the test: { round ( cumulative ( 60 , mean , sd ), 3 ) } ' ) Probability that the student failed the test: 0.159","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day5.html","loc":"redoules.github.io/blog/Statistics_10days-day5.html"},{"title":"Normal Distribution","text":"Normal Distribution The probability density of normal distribution is: $$\\mathcal{N}(\\mu,\\sigma^2)=\\frac{1}{\\sigma\\sqrt{2\\pi}}e^{-\\frac{(x-\\mu)^2}{2\\sigma^2}}$$ where, * \\(\\mu\\) is the mean (or expectation) of the distribution. It is also equal to median and mode of the distribution. * \\(\\sigma^2\\) is the variance. * \\(\\sigma\\) is the standard deviation. Standard Normal Distribution If \\(\\mu=0\\) and \\(\\sigma=1\\) , then the normal distribution is known as standard normal distribution: $$\\phi(x)=\\frac{e^{-\\frac{x^2}{2}}}{\\sigma\\sqrt{2\\pi}}$$ Every normal distribution can be represented as standard normal distribution: $$\\mathcal{N}(\\mu,\\sigma^2)=\\frac{1}{\\sigma}\\phi(\\frac{x-\\mu}{\\sigma})$$ Cumulative Probability Consider a real-valued random variable, \\(X\\) . The cumulative distribution function of \\(X\\) (or just the distribution function of \\(X\\) ) evaluated at \\(x\\) is the probability that \\(X\\) will take a value less than or equal to \\(x\\) : $$F_X(x)=P(X\\leq x)$$ also, $$P(a\\leq X\\leq b)=P(a\\lt X\\lt b)=F_X(b)-F_X(a)$$ the cumulative distribution function for a function with normal distribution is: $$\\Phi(x)=\\frac{1}{2}\\left(1+erf\\left(\\frac{x-\\mu}{\\sigma\\sqrt{2}}\\right)\\right)$$ where \\(erf\\) is the error function: $$erf(z)=\\frac{2}{\\sqrt{\\pi}}\\int_0^ze^{-x^2}dx$$","tags":"Mathematics","url":"redoules.github.io/mathematics/normal.html","loc":"redoules.github.io/mathematics/normal.html"},{"title":"Poisson Distribution","text":"Poisson Experiment Poisson experiment is a statistical experiment that has the following properties: * The outcome of each trial is either success or failure. * The average number of successes ( \\(\\lambda\\) ) that occurs in a specified region is known. * The probability that a success will occur is proportional to the size of the region. * The probability that a success will occur in an extremely small region is virtually zero. Poisson Distribution A Poisson random variable is the number of successes that result from a Poisson experiment. The probability distribution of a Poisson random variable is called a Poisson distribution: $$P(k,\\lambda)=\\frac{\\lambda^ke^{-\\lambda}}{k!}$$ where : * \\(\\lambda\\) is the average number of successes that occur in a specified region. * \\(k\\) is the actual number of successes that occur in a specified region. * \\(P(k,\\lambda)\\) is the Poisson probability, which is the probability of getting exactly \\(k\\) successes when the average number of successes is \\(\\lambda\\) . Example The average number of goals in the soccer world cup is 2.5. The probability that 4 goals are scored is then: $$p(\\lambda=2.5,k=4)=\\frac{2.5^4e^{-2.5}}{4!}=0.133$$ Expectation for the Poisson distribution Consider some Poisson random variable, \\(X\\) . Let \\(E[X]\\) be the expectation of \\(X\\) . Find the value of \\(E[X^2]\\) . Let \\(Var(X)\\) be the variance of \\(X\\) . Recall that if a random variable has a Poisson distribution, then: * \\(E[X]=\\lambda\\) * \\(Var[X]=\\lambda\\) Now, we'll use the following property of expectation and variance for any random variable, \\(X\\) : $$Var(X)=E[X^2]-(E[X])^2$$ $$E[X^2]=Var(X)+(E[X])^2$$ So, for any random variable having a Poisson distribution, the above result can be rewritten as: $$E[X^2]=\\lambda + \\lambda^2$$","tags":"Mathematics","url":"redoules.github.io/mathematics/poisson.html","loc":"redoules.github.io/mathematics/poisson.html"},{"title":"Geometric distribution","text":"Negative Binomial Experiment A negative binomial experiment is a statistical experiment that has the following properties: The experiment consists of n repeated trials. The trials are independent. The outcome of each trial is either success (s) or failure (f). \\(P(s)\\) is the same for every trial. The experiment continues until x successes are observed If \\(X\\) is the number of experiments until the \\(x^{th}\\) success occures, then \\(X\\) is a discrete random variable called a negative binomial Negative Binomial Distribution Consider the following probability mass function: $$b^*(x,n,p) = {\\binom{n-1}{x-1}}p^xq^{n-x}$$ The function above is negative binomial and has the following properties: The number of successes to be observed is \\(x\\) . The total number of trials is \\(n\\) . The probability of success of 1 trial is \\(p\\) . The probability of failure of 1 trial \\(q\\) , where \\(q=1-p\\) . \\(b^*(x,n,p)\\) is the negative binomial probability , meaning the probability of having exactly \\(x-1\\) successes out of \\(n-1\\) trials and having \\(x\\) successes after \\(n\\) trials. Geometric Distribution The geometric distribution is a special case of the negative binomial distribution that deals with the number of Bernoulli trials required to get a success (i.e., counting the number of failures before the first success). Recall that \\(X\\) is the number of successes in \\(n\\) independent Bernoulli trials, so for each \\(i\\) (where $1\\leq i\\leq n): $ X_i = \\begin{cases} 1 if the i^{th} trial is a success \\\\ 0 otherwise x \\end{cases} $ The geometric distribution is a negative binomial distribution where the number of successes is 1. We express this with the following formula: $$g(n,p)=q^{n-1}p$$ Example Bob is a high school basketball player. He is a 70% free throw shooter, meaning his probability of making a free throw is 0.7. What is the probability that Bob makes his first free throw on his fifth shot? For this experiment n=5, p=0.7 and q=0.3 So : $$g(n=5, p=0.7)=0.3^4 0.7=0.00567$$","tags":"Mathematics","url":"redoules.github.io/mathematics/Geometric.html","loc":"redoules.github.io/mathematics/Geometric.html"},{"title":"Binomial distribution","text":"Binomial Experiment A binomial experiment (or Bernoulli trial) is a statistical experiment that has the following properties: The experiment consists of n repeated trials. The trials are independent. The outcome of each trial is either success (s) or failure (f). Binomial Distribution We define a binomial process to be a binomial experiment meeting the following conditions: The number of successes is \\(x\\) . The total number of trials is \\(n\\) . The probability of success of 1 trial is \\(p\\) . The probability of failure of 1 trial \\(q\\) , where \\(q=1-p\\) . \\(b(x,n,p)\\) is the binomial probability , meaning the probability of having exactly \\(x\\) successes out of \\(n\\) trials. The binomial random variable is the number of successes, \\(x\\) , out of \\(n\\) trials. The binomial distribution is the probability distribution for the binomial random variable, given by the following probability mass function: $$b(x,n,p) = \\frac{n!}{x!(n-x)!}p^xq^{n-x}$$ Python code for the Binomial distribution import math def bi_dist ( x , n , p ): b = ( math . factorial ( n ) / ( math . factorial ( x ) * math . factorial ( n - x ))) * ( p ** x ) * (( 1 - p ) ** ( n - x )) return ( b ) Using numpy import numpy as np n = 10 #number of coin toss p = 0.5 #probability samples = 1000 #number of samples s = np . random . binomial ( n , p , samples ) Using the stats module from scipy from scipy.stats import binom n = 10 #number of coin toss p = 0.5 #probability samples = 1000 #number of samples s = binom . rvs ( n , p , size = samples ) Cumulative probabilities A cumulative probability refers to the probability that the value of a random variable falls within a specified range. Frequently, cumulative probabilities refer to the probability that a random variable is less than or equal to a specified value. A fair coin is tossed 10 times. Probability of getting 5 heads The probability of getting heads is: $$b(x=5, n=10, p=0.5)=0.246$$ Probability of getting at least 5 heads The probability of getting at least heads is: $$b(x\\geq 5, n=10, p=0.5)= \\sum_{r=5}^{10} b(x=r, n=10, p=0.5)$$ $$b(x\\geq 5, n=10, p=0.5)= 0.623$$ Probability of getting at most 5 heads The probability of getting at most heads is: $$b(x\\leq 5, n=10, p=0.5)= \\sum_{r=0}^{5} b(x=r, n=10, p=0.5)$$ $$b(x\\leq 5, n=10, p=0.5)= 0.623$$","tags":"Mathematics","url":"redoules.github.io/mathematics/Binomial.html","loc":"redoules.github.io/mathematics/Binomial.html"},{"title":"Day 4 - Binomial and geometric distributions","text":"Binomial distribution Problem 1 The ratio of boys to girls for babies born in Russia is \\(r=\\frac{N_b}{N_g}=1.09\\) . If there is 1 child born per birth, what proportion of Russian families with exactly 6 children will have at least 3 boys? Mathematical explanation Let's first compute the probability of having a boy : $$p_b=\\frac{N_b}{N_b+N_g}$$ where: * \\(N_b\\) is the number of boys * \\(N_g\\) is the number of girls * \\(r=\\frac{N_b}{N_g}\\) $$p_b=\\frac{1}{1+\\frac{1}{r}}$$ $$p_b=\\frac{r}{r+1}$$ r = 1.09 p_b = r / ( r + 1 ) print ( f \"The probability of having a boy is p= { p_b : 3f } \" ) The probability of having a boy is p=0.521531 The probability of getting 3 boys in 6 children is given by : $$b(x=3, n=6, p=p_b)$$ In order to compute the proportion of Russian families with exactly 6 children will have at 3 least boys we need to compute the cumulative probability distribution $$b(x\\geq 3, n=6, p=p_b) = \\sum_{i=3}^{6} b(x\\geq i, n=6, p=p_b)$$ Let's code it ! import math def bi_dist ( x , n , p ): b = ( math . factorial ( n ) / ( math . factorial ( x ) * math . factorial ( n - x ))) * ( p ** x ) * (( 1 - p ) ** ( n - x )) return ( b ) b , p , n = 0 , p_b , 6 for i in range ( 3 , 7 ): b += bi_dist ( i , n , p ) print ( f \"probability of getting at least 3 boys in a family with exactly 6 children : { b : .3f } \" ) probability of getting at least 3 boys in a family with exactly 6 children : 0.696 Problem 2 A manufacturer of metal pistons finds that, 12% on average, of the pistons they manufacture are rejected because they are incorrectly sized. What is the probability that a batch of 10 pistons will contain: * No more than 2 rejects? * At least 2 rejects? Mathematical explanation On average 12% of the pistons are rejected, this means that a piston has a probability of \\(p_{rejected}=0.12\\) to be rejected. The probability of getting less than 2 faulty pistons in a batch is : $$p(rejet<2) = b(x\\leq 2, n= 10, p=p_{rejected})$$ $$p(rejet<2) = \\sum_{i=0}^{2} b(x\\leq i, n=10, p=p_{rejected})$$ b , p , n = 0 , 12 / 100 , 10 for i in range ( 0 , 3 ): b += bi_dist ( i , n , p ) print ( f \"The probability of getting less than 2 faulty pistons in a batch is : { b : .3f } \" ) The probability of getting less than 2 faulty pistons in a batch is : 0.891 The probability that a batch of 10 pistons will contain at least 2 rejects : $$p(rejet<2) = b(x\\geq 2, n= 10, p=p_{rejected})$$ $$p(rejet<2) = \\sum_{i=2}^{10} b(x\\geq i, n=10, p=p_{rejected})$$ b , p , n = 0 , 12 / 100 , 10 for i in range ( 2 , 11 ): b += bi_dist ( i , n , p ) print ( f \"The probability of getting at least 2 faulty pistons in a batch is : { b : .3f } \" ) The probability of getting at least 2 faulty pistons in a batch is : 0.342 Geometric distribution Problem 1 The probability that a machine produces a defective product is \\(\\frac{1}{3}\\) . What is the probability that the first defect is found during the fith inspection? Mathematical explanation In this case, we will use a geometric distribution to evaluate the probability : * \\(n=5\\) * \\(p=\\frac{1}{3}\\) Hence, the probability that the first defect is found during the fith inspection is \\(g(n=5,p=1/3)\\) print ( f \"The probability that the first defect is found during the fith inspection is { round ((( 1 - p ) ** ( n - 1 )) * p , 3 ) } \" ) The probability that the first defect is found during the fith inspection is 0.038 Problem 2 The probability that a machine produces a defective product is \\(\\frac{1}{3}\\) . What is the probability that the first defect is found during the first 5 inspections? Mathematical explanation In this problem, we need to compute the cumulative distribution function $$p(x \\leq5) = \\sum_{i=1}^{5} g(n=i,p=1/3)$$ p_x5 = 0 p = 1 / 3 n = 5 for i in range ( 1 , n + 1 ): p_x5 += ( 1 - p ) ** ( i - 1 ) * p print ( f \"The probability that the first defect is found during the first 5 inspection is { round ( p_x5 , 3 ) } \" ) The probability that the first defect is found during the first 5 inspection is 0.868","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day4.html","loc":"redoules.github.io/blog/Statistics_10days-day4.html"},{"title":"Premutations and combinations","text":"Finding patterns in the possible ways events can occur is very useful in helping us count the number of desirable events in our sample space. Two of the easiest methods for doing this are with permutations (when order matters) and combinations (when order doesn't matter). Permutations An ordered arrangement r of objects from a set, A, of n objects (where \\(0\\lt r \\leq n\\) ) is called an r-element permutation of A. You can also think of this as a permutation of A's elements taken r at a time. The number of r-element permutations of an n-object set is denoted by the following formula: $$_{n}P_{r}=\\frac{n!}{(n-r)!}$$ Combinations An unordered arrangement of r objects from a set, A, of n objects (where \\(0\\lt r \\leq n\\) ) is called an r-element combination of A. You can also think of this as a combination of A's elements taken r at a time. Because the only difference between permutations and combinations is that combinations are unordered, we can easily find the number of r-element combinations by dividing out the permutations (r!): $$_{n}C_{r}=\\frac{_{n}P_{r}}{r!}=\\frac{n!}{r!(n-r)!}$$ When we talk about combinations, we're talking about the number of subsets of size r that can be made from a set of size n. In fact, \\(_{n}C_{r}\\) is often referred to as \"n choose r\", because it's counting the number of r-element combinations that can be chosen from a set of n elements.","tags":"Mathematics","url":"redoules.github.io/mathematics/Premutations_and_combinations.html","loc":"redoules.github.io/mathematics/Premutations_and_combinations.html"},{"title":"Conditional Probability","text":"This is defined as the probability of an event occurring, assuming that one or more other events have already occurred. Two events, A and B are considered to be independent if event A has no effect on the probability of event B (i.e. P(B|A)=P(A)). If events A and B are not independent, then we must consider the probability that both events occur. This can be referred to as the intersection of events A and B, defined as P(A∩B) = P(B|A)P(A). We can then use this definition to find the conditional probability by dividing the probability of the intersection of the two events (A∩B) by the probability of the event that is assumed to have already occurred (event A): $$ P(B|A)=\\frac{P(A\\cap B)}{P(A)}$$","tags":"Mathematics","url":"redoules.github.io/mathematics/cond_prob.html","loc":"redoules.github.io/mathematics/cond_prob.html"},{"title":"Bayes' Theorem","text":"Let A and B be two events such that P(A|B) denotes the probability of the occurrence of A given that B has occurred and denotes the probability of the occurrence B of given that A has occurred, then: $$ P(A|B)=\\frac{P(B|A)P(A)}{P(B)}$$ $$ P(A|B)=\\frac{P(B|A)P(A)}{P(B|A)P(A)+P(B|A^c)P(A^c)}$$","tags":"Mathematics","url":"redoules.github.io/mathematics/Bayes.html","loc":"redoules.github.io/mathematics/Bayes.html"},{"title":"Day 3 - Conditionnal probability","text":"Conditionnal probability Problem Suppose a family has 2 children, one of which is a boy. What is the probability that both children are boys? Mathematical explanation Let's look at the possible outcomes : B G B BB BG G GB GG We know that at least one of the children is a boy, so only \"GG\" is not possible. The event where the family has a new boy is then \"BB\". Hence the probability is : $$\\frac{BB}{BB+GB+BG}=\\frac{1}{3}$$ Draw 2 cards from a deck Problem You 2 draw cards from a standard 52-card deck without replacing them. What is the probability that both cards are of the same suit? Mathematical explanation There are 13 cards of each suit. Draw one card. It can be anything with probability of 1. Now there are 51 cards left and 12 of them are the same suit as the first card you drew. So the chance the second card matches the 1st is \\(\\frac{12}{51}\\) . Drawing marbles Problem A bag contains 3 red marbles and 4 blue marbles. Then, 2 marbles are drawn from the bag, at random, without replacement. If the first marble drawn is red, what is the probability that the second marble is blue? Mathematical explanation On the first draw, the probabilities are the following : we call B the event \"a blue ball is drawn\" and R the event \"a red ball is drawn\" * \\(P(B)=\\frac{4}{7}\\) * \\(P(R)=\\frac{3}{7}\\) On the second draw, if a red ball has been drawn at first, the probabilities are : * \\(P(B|R)=\\frac{4}{6}\\) * \\(P(R|R)=\\frac{2}{6}\\) Hence, the probability of drawing a blue ball if the first ball drawn was red is \\(\\frac{1}{3}\\)","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day3.html","loc":"redoules.github.io/blog/Statistics_10days-day3.html"},{"title":"Day 2 - Probability, Compound Event Probability","text":"Basic probability with dices Problem In this challenge, we practice calculating probability. In a single toss of 2 fair (evenly-weighted) six-sided dice, find the probability that their sum will be at most 9. Mathematical explanation A nice way to think about sums-of-two-dice problems is to lay out the sums in a 6-by-6 grid in the obvious manner. 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12 We see that the identic values are on the same diagonal. The number of elements on the diagonal varies from 1 to 6 and then back to 1. let's call A < x the event : the sum all the 2 tosses is at most x. $$P(A\\leq9)=\\sum_{i=2}^{9} P(A = i)$$ $$P(A\\leq9)=1-P(A\\gt9)$$ $$P(A\\leq9)=1-\\sum_{i=10}^{12} P(A = i)$$ The value of \\(P(A = i) = \\frac{i-1}{36}\\) if \\(i \\leq 7\\) and \\(P(A = i) = \\frac{13-i}{36}\\) hence $$P(A\\leq9)=1-\\sum_{i=10}^{12} \\frac{13-i}{36}$$ $$P(A\\leq9)= 1-\\frac{6}{36}$$ $$P(A\\leq9)= \\frac{5}{6}$$ Let's program it sum ([ 1 for d1 in range ( 1 , 7 ) for d2 in range ( 1 , 7 ) if d1 + d2 <= 9 ]) / 36 0.8333333333333334 More dices Problem In a single toss of 2 fair (evenly-weighted) six-sided dice, find the probability that the values rolled by each die will be different and the two dice have a sum of 6. Mathematical explanation Let's consider 2 events : A and B. A compound event is a combination of 2 or more simple events. If A and B are simple events, then A∪B denotes the occurence of either A or B. A∩B denotes the occurence of A and B together. We denote A the event \"the values of each dice is different\". The opposit event is A' \"the values of each dice is the same\". $$P(A) = 1-P(A')$$ $$P(A)=1-\\frac{6}{36}$$ $$P(A)=\\frac{5}{6}$$ We denote B the event \"the two dice have a sum of 6\", this probability has been computed on the first part of the article : $$P(B)=\\frac{5}{36}$$ The probability of having 2 dice different of sum 6 is : $$P(A|B) = 4/5$$ The probability that both A and B occure is equal to P(A∩B). Since \\(P(A|B)=\\frac{P(A∩B)}{P(B)}\\) $$P(A∩B)=P(B)*P(A|B)$$ $$P(A∩B)=5/36*4/5$$ $$P(A∩B)=1/9$$ Let's program it sum ([ 1 for d1 in range ( 1 , 7 ) for d2 in range ( 1 , 7 ) if ( d1 + d2 == 6 ) and ( d1 != d2 )]) / 36 0.1111111111111111 Compound Event Probability Problem There are 3 urns labeled X, Y, and Z. Urn X contains 4 red balls and 3 black balls. Urn Y contains 5 red balls and 4 black balls. Urn Z contains 4 red balls and 4 black balls. One ball is drawn from each of the urns. What is the probability that, of the 3 balls drawn, are 2 red and is 1 black? Mathematical explanation Let's write the different probabilities: Red ball Black ball Urne X $$\\frac{4}{7}$$ $$\\frac{3}{7}$$ Urne Y $$\\frac{5}{9}$$ $$\\frac{4}{9}$$ Urne Z $$\\frac{1}{2}$$ $$\\frac{1}{2}$$ Addition rule A and B are said to be mutually exclusive or disjoint if they have no events in common (i.e., and A∩B=∅ and P(A∩B)=0. The probability of any of 2 or more events occurring is the union (∪) of events. Because disjoint probabilities have no common events, the probability of the union of disjoint events is the sum of the events' individual probabilities. A and B are said to be collectively exhaustive if their union covers all events in the sample space (i.e., A∪B=S and P(A∪B)=1). This brings us to our next fundamental rule of probability: if 2 events, A and B, are disjoint, then the probability of either event is the sum of the probabilities of the 2 events (i.e., P(A or B) = P(A)+P(B)) Mutliplication rule If the outcome of the first event (A) has no impact on the second event (B), then they are considered to be independent (e.g., tossing a fair coin). This brings us to the next fundamental rule of probability: the multiplication rule. It states that if two events, A and B, are independent, then the probability of both events is the product of the probabilities for each event (i.e., P(A and B)= P(A)xP(B)). The chance of all events occurring in a sequence of events is called the intersection (∩) of those events. The balls drawn from the urns are independant hence : p = P(2 red (R) and 1 back (B)) $$p = P(RRB) + P(RBR) + P(BRR)$$ Each of those 3 probability if equal to the product of the probability of drawing each ball \\(P(RRB) = P(R|X) * P(R|Y) * P(B|Z) = 4/7*5/9*1/2\\) \\(P(RRB) = 20/126\\) \\(P(RBR) = 16/126\\) \\(P(BRR) = 15/126\\) this leads to \\(p = 51/126\\) and finally $$p = \\frac{17}{42}$$ Let's program it X = 3 * [ \"B\" ] + 4 * [ \"R\" ] Y = 4 * [ \"B\" ] + 5 * [ \"R\" ] Z = 4 * [ \"B\" ] + 4 * [ \"R\" ] target = [ \"BRR\" , \"RRB\" , \"RBR\" ] sum ([ 1 for x in X for y in Y for z in Z if x + y + z in target ]) / sum ([ 1 for x in X for y in Y for z in Z ]) 0.40476190476190477","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day2.html","loc":"redoules.github.io/blog/Statistics_10days-day2.html"},{"title":"Day 1 - Quartiles, Interquartile Range and standard deviation","text":"Quartile Definition A quartile is a type of quantile. The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is the middle value between the median and the highest value of the data set. Implementation in python without using the scientific libraries def median ( l ): l = sorted ( l ) if len ( l ) % 2 == 0 : return ( l [ len ( l ) // 2 ] + l [( len ( l ) // 2 - 1 )]) / 2 else : return l [ len ( l ) // 2 ] def quartiles ( l ): # check the input is not empty if not l : raise StatsError ( 'no data points passed' ) # 1. order the data set l = sorted ( l ) # 2. divide the data set in two halves mid = int ( len ( l ) / 2 ) Q2 = median ( l ) if ( len ( l ) % 2 == 0 ): # even Q1 = median ( l [: mid ]) Q3 = median ( l [ mid :]) else : # odd Q1 = median ( l [: mid ]) # same as even Q3 = median ( l [ mid + 1 :]) return ( Q1 , Q2 , Q3 ) L = [ 3 , 7 , 8 , 5 , 12 , 14 , 21 , 13 , 18 ] Q1 , Q2 , Q3 = quartiles ( L ) print ( f \"Sample : { L } \\n Q1 : { Q1 } , Q2 : { Q2 } , Q3 : { Q3 } \" ) Sample : [3, 7, 8, 5, 12, 14, 21, 13, 18] Q1 : 6.0, Q2 : 12, Q3 : 16.0 Interquartile Range Definition The interquartile range of an array is the difference between its first (Q1) and third (Q3) quartiles. Hence the interquartile range is Q3-Q1 Implementation in python without using the scientific libraries print ( f \"Interquatile range : { Q3 - Q1 } \" ) Interquatile range : 10.0 Standard deviation Definition The standard deviation (σ) is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values. The standard deviation can be computed with the formula: where µ is the mean : Implementation in python without using the scientific libraries import math X = [ 10 , 40 , 30 , 50 , 20 ] mean = sum ( X ) / len ( X ) X = [( x - mean ) ** 2 for x in X ] std = math . sqrt ( sum ( X ) / len ( X ) ) print ( f \"The distribution { X } has a standard deviation of { std } \" ) The distribution [400.0, 100.0, 0.0, 400.0, 100.0] has a standard deviation of 14.142135623730951","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day1.html","loc":"redoules.github.io/blog/Statistics_10days-day1.html"},{"title":"Counting values in an array","text":"Using lists If you want to count the number of occurences of an element in a list you can use the .count() function of the list object arr = [ 1 , 2 , 3 , 3 , 4 , 5 , 3 , 6 , 7 , 7 ] print ( f 'Array : { arr } \\n ' ) print ( f 'The number 3 appears { arr . count ( 3 ) } times in the list' ) print ( f 'The number 7 appears { arr . count ( 7 ) } times in the list' ) print ( f 'The number 4 appears { arr . count ( 4 ) } times in the list' ) Array : [ 1 , 2 , 3 , 3 , 4 , 5 , 3 , 6 , 7 , 7 ] The number 3 appears 3 times in the list The number 7 appears 2 times in the list The number 4 appears 1 times in the list Using collections you can get a dictonnary of the number of occurences of each elements in a list thanks to the collections object like this import collections collections . Counter ( arr ) Counter({1: 1, 2: 1, 3: 3, 4: 1, 5: 1, 6: 1, 7: 2}) Using numpy You can have a simular result with numpy by hacking the unique function import numpy as np arr = np . array ( arr ) unique , counts = np . unique ( arr , return_counts = True ) dict ( zip ( unique , counts )) {1: 1, 2: 1, 3: 3, 4: 1, 5: 1, 6: 1, 7: 2}","tags":"Python","url":"redoules.github.io/python/counting.html","loc":"redoules.github.io/python/counting.html"},{"title":"Building a dictonnary using comprehension","text":"An easy way to create a dictionnary in python is to use the comprehension syntaxe. It can be more expressive hence easier to read. d = { key : value for ( key , value ) in iterable } In the example bellow we use the dictionnary comprehension to build a dictonnary from a source list. iterable = list ( range ( 10 )) d = { str ( value ): value ** 2 for value in iterable } # create a dictionnary linking the string value of a number with the square value of this number print ( d ) {'0': 0, '1': 1, '2': 4, '3': 9, '4': 16, '5': 25, '6': 36, '7': 49, '8': 64, '9': 81} of course, you can use an other iterable an repack it with the comprehension syntaxe. In the following example, we convert a list of tuples in a dictonnary. iterable = [( \"France\" , 67.12e6 ), ( \"UK\" , 66.02e6 ), ( \"USA\" , 325.7e6 ), ( \"China\" , 1386e6 ), ( \"Germany\" , 82.79e6 )] population = { key : value for ( key , value ) in iterable } print ( population ) {'France': 67120000.0, 'UK': 66020000.0, 'USA': 325700000.0, 'China': 1386000000.0, 'Germany': 82790000.0}","tags":"Python","url":"redoules.github.io/python/dict_comprehension.html","loc":"redoules.github.io/python/dict_comprehension.html"},{"title":"Extracting unique values from a list or an array","text":"Using lists An easy way to extract the unique values of a list in python is to convert the list to a set. A set is an unordered collection of items. Every element is unique (no duplicates) and must be immutable. my_list = [ 10 , 20 , 30 , 40 , 20 , 50 , 60 , 40 ] print ( f \"Original List : { my_list } \" ) my_set = set ( my_list ) my_new_list = list ( my_set ) # the set is converted back to a list with the list() function print ( f \"List of unique numbers : { my_new_list } \" ) Original List : [10, 20, 30, 40, 20, 50, 60, 40] List of unique numbers : [40, 10, 50, 20, 60, 30] Using numpy If you are using numpy you can extract the unique values of an array with the unique function builtin numpy: import numpy as np arr = np . array ( my_list ) print ( f 'Initial numpy array : { arr } \\n ' ) unique_arr = np . unique ( arr ) print ( f 'Numpy array with unique values : { unique_arr } ' ) Initial numpy array : [ 10 20 30 40 20 50 60 40 ] Numpy array with unique values : [ 10 20 30 40 50 60 ]","tags":"Python","url":"redoules.github.io/python/unique.html","loc":"redoules.github.io/python/unique.html"},{"title":"Sorting an array","text":"Using lists Python provides an iterator to sort an array sorted() you can use it this way : import random # Random lists from [0-999] interval arr = [ random . randint ( 0 , 1000 ) for r in range ( 10 )] print ( f 'Initial random list : { arr } \\n ' ) reversed_arr = list ( sorted ( arr )) print ( f 'Sorted list : { reversed_arr } ' ) Initial random list : [ 277 , 347 , 976 , 367 , 604 , 878 , 148 , 670 , 229 , 432 ] Sorted list : [ 148 , 229 , 277 , 347 , 367 , 432 , 604 , 670 , 878 , 976 ] it is also possible to use the sort function from the list object # Random lists from [0-999] interval arr = [ random . randint ( 0 , 1000 ) for r in range ( 10 )] print ( f 'Initial random list : { arr } \\n ' ) arr . sort () print ( f 'Sorted list : { arr } ' ) Initial random list : [ 727 , 759 , 68 , 103 , 23 , 90 , 258 , 737 , 791 , 567 ] Sorted list : [ 23 , 68 , 90 , 103 , 258 , 567 , 727 , 737 , 759 , 791 ] Using numpy If you are using numpy you can sort an array by creating a view on the array: import numpy as np arr = np . random . random ( 5 ) print ( f 'Initial random array : { arr } \\n ' ) sorted_arr = np . sort ( arr ) print ( f 'Sorted array : { sorted_arr } ' ) Initial random array : [ 0 . 40021786 0 . 13876208 0 . 19939047 0 . 46015169 0 . 43734158 ] Sorted array : [ 0 . 13876208 0 . 19939047 0 . 40021786 0 . 43734158 0 . 46015169 ]","tags":"Python","url":"redoules.github.io/python/sorting.html","loc":"redoules.github.io/python/sorting.html"},{"title":"Day 0 - Median, mean, mode and weighted mean","text":"A reminder The median The median is the value separating the higher half from the lower half of a data sample. For a data set, it may be thought of as the middle value. For a continuous probability distribution, the median is the value such that a number is equally likely to fall above or below it. The mean The arithmetic mean (or simply mean) of a sample is the sum of the sampled values divided by the number of items. The mode The mode of a set of data values is the value that appears most often. It is the value x at which its probability mass function takes its maximum value. In other words, it is the value that is most likely to be sampled. Implementation in python without using the scientific libraries def median ( l ): l = sorted ( l ) if len ( l ) % 2 == 0 : return ( l [ len ( l ) // 2 ] + l [( len ( l ) // 2 - 1 )]) / 2 else : return l [ len ( l ) // 2 ] def mean ( l ): return sum ( l ) / len ( l ) def mode ( data ): dico = { x : data . count ( x ) for x in list ( set ( data ))} return sorted ( sorted ( dico . items ()), key = lambda x : x [ 1 ], reverse = True )[ 0 ][ 0 ] L = [ 64630 , 11735 , 14216 , 99233 , 14470 , 4978 , 73429 , 38120 , 51135 , 67060 , 4978 , 73429 ] print ( f \"Sample : { L } \\n Mean : { mean ( L ) } , Median : { median ( L ) } , Mode : { mode ( L ) } \" ) Sample : [64630, 11735, 14216, 99233, 14470, 4978, 73429, 38120, 51135, 67060, 4978, 73429] Mean : 43117.75, Median : 44627.5, Mode : 4978 The weighted average The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. data = [ 10 , 40 , 30 , 50 , 20 ] weights = [ 1 , 2 , 3 , 4 , 5 ] sum_X = sum ([ x * w for x , w in zip ( data , weights )]) print ( round (( sum_X / sum ( weights )), 1 )) 32.0","tags":"Blog","url":"redoules.github.io/blog/Statistics_10days-day0.html","loc":"redoules.github.io/blog/Statistics_10days-day0.html"},{"title":"Create a simple bash function","text":"A basic function The synthaxe to define a function is : #!/bin/bash # Basic function my_function () { echo Text displayed by my_function } #once defined, you can use it like so : my_function and it should return user@bash : ./my_function.sh Text displayed by my_function Function with arguments When used, the arguments are specified directly after the function name. Whithin the function they are accessible this the $ symbol followed by the number of the arguement. Hence $1 will take the value of the first arguement, $2 will take the value of the second arguement and so on. #!/bin/bash # Passing arguments to a function say_hello () { echo Hello $1 } say_hello Guillaume and it should return user@bash : ./function_arguements.sh Hello Guillaume Overriding Commands Using the previous example, let's override the echo function in order to make it say hello. To do so, you just need to name the function with the same name as the command you want to replace. When you are calling the original function, make sure you are using the builtin keyword #!/bin/bash # Overriding a function echo () { builtin echo Hello $1 } echo Guillaume user@bash : ./function_arguements.sh Hello Guillaume Returning values Use the keyword return to send back a value to the main program. The returned value will be stored in the $? variable #!/bin/bash # Retruning a value secret_number () { return 126 } secret_number echo The secret number is $? This code should return user@bash : ./retrun_value.sh The secret number is 126","tags":"Linux","url":"redoules.github.io/linux/simple_bash_function.html","loc":"redoules.github.io/linux/simple_bash_function.html"},{"title":"Number of edges in a Complete graph","text":"A complete graph contains \\(\\frac{n(n-1)}{2}\\) edges where \\(n\\) is the number of vertices (or nodes).","tags":"Mathematics","url":"redoules.github.io/mathematics/Number_edges_Complete_graph.html","loc":"redoules.github.io/mathematics/Number_edges_Complete_graph.html"},{"title":"Reverse an array","text":"Using lists Python provides an iterator to reverse an array reversed() you can use it this way : arr = list ( range ( 5 )) print ( f 'Initial array : { arr } \\n ' ) reversed_arr = list ( reversed ( arr )) print ( f 'Reversed array : { reversed_arr } ' ) Initial array : [ 0 , 1 , 2 , 3 , 4 ] Reversed array : [ 4 , 3 , 2 , 1 , 0 ] Using numpy If you are using numpy you can reverse an array by creating a view on the array: import numpy as np arr = np . arange ( 5 ) print ( f 'Initial array : { arr } \\n ' ) reversed_arr = arr [:: - 1 ] print ( f 'Reversed array : { reversed_arr } ' ) Initial array : [ 0 1 2 3 4 ] Reversed array : [ 4 3 2 1 0 ]","tags":"Python","url":"redoules.github.io/python/reverse.html","loc":"redoules.github.io/python/reverse.html"},{"title":"Advice for designing your own libraries","text":"Advice for designing your own libraries When designing your own library make sure to think of the following things. I will add new paragraphs to this article as I dicover new good practices. Use standard python objects Try to use standard python objects as much as possible. That way, your library becomes compatible with all the other python libaries. For instance, when I created SAMpy : a library for reading and writing SAMCEF results, it returned dictonnaries, lists and pandas dataframes. Hence the results extracted from SAMCEF where compatible with all the scientific stack of python. Limit the number of functionnalities Following the same logic as before, the objects should do only one thing but do it well. Indeed, having a simple interface will reduce the complexity of your code and make it easier to use your library. Again, with SAMpy, I decided to strictly limit the functionnalities to reading and writing SAMCEF files. Define an exception class for your library You should define your own exceptions in order to make it easier for your users to debug their code thanks to clearer messages that convey more meaning. That way, the user will know if the error comes from your library or something else. Bonus if you group similar exceptions in a hierachy of inerited Exception classes. Example : let's create a Exception related to the age of a person : def check_age ( age ): if age < 0 and age > 130 : raise ValueError If the user inputed an invalid age, the ValueError exception would be thrown. That's fine but imagine you wan't to provide more feedback to your users that don't know the internal of your library. Let's now create a selfexplanatory Exception class AgeInvalidError ( ValueError ): pass def check_age ( age ): if age < 0 and age > 130 : raise AgeInvalidError ( age ) You can also add some helpful text to guide your users along the way: class AgeInvalidError ( ValueError ): print ( \"Age invalid, must be between 0 and 130\" ) pass def check_age ( age ): if age < 0 and age > 130 : raise AgeInvalidError ( age ) If you want to group all the logically linked exceptions, you can create a base class and inherit from it : class BaseAgeInvalidError ( ValueError ): pass class TooYoungError ( BaseAgeInvalidError ): pass class TooOldError ( BaseAgeInvalidError ): pass def check_age ( age ): if age < 0 : raise TooYoungError ( age ) elif age > 130 : raise TooOldError ( age ) Structure your repository You should have a file structure in your repository. It will help other contributers especially future contributers. A nice directory structure for your project should look like this: README.md LICENSE setup.py requirements.txt ./MyPackage ./docs ./tests Some prefer to use reStructured Text, I personnaly prefer Markdown choosealicense.com will help you pick the license to use for your project. For package and distribution management, create a setup.py file a the root of the directory The list of dependencies required to test, build and generate the doc are listed in a pip requirement file placed a the root of the directory and named requirements.txt Put the documentation of your library in the docs directory. Put your tests in the tests directory. Since your tests will need to import your library, I recommend modifying the path to resolve your package property. In order to do so, you can create a context.py file located in the tests directory : import os import sys sys . path . insert ( 0 , os . path . abspath ( os . path . join ( os . path . dirname ( __file__ ), '..' ))) import MyPackage Then within your individual test files you can import your package like so : from .context import MyPackage Finally, your code will go into the MyPackage directory Test your code Once your library is in production, you have to guaranty some level of forward compatibility. Once your interface is defined, write some tests. In the future, when your code is modified, having those tests will make sure that the behaviour of your functions and objects won't be altered. Document your code Of course, you should have a documentation to go along with your library. Make sure to add a lot of commun examples as most users tend to learn from examples. I recommend writing your documentation using Sphinx.","tags":"Python","url":"redoules.github.io/python/design_own_libs.html","loc":"redoules.github.io/python/design_own_libs.html"},{"title":"Safely creating a folder if it doesn't exist","text":"Safely creating a folder if it doesn't exist When you are writing to files in python, if the file doesn't exist it will be created. However, if you are trying to write a file in a directory that doesn't exist, an exception will be returned FileNotFoundError : [ Errno 2 ] No such file or directory : \"directory\" This article will teach you how to make sure the target directory exists. If it doesn't, the function will create that directory. First, let's import os and make sure that the \"test_directory\" doesn't exist import os os . path . exists ( \". \\\\ test_directory\" ) False copy the ensure_dir function in your code. This function will handle the creation of the directory. Credit goes to Parand posted on StackOverflow def ensure_dir ( file_path ): directory = os . path . dirname ( file_path ) if not os . path . exists ( directory ): os . makedirs ( directory ) Let's now use the function and create a folder named \"test_directory\" ensure_dir ( \". \\\\ test_directory\" ) If we test for the existence of the directory, the exists function will now return True os . path . exists ( \". \\\\ test_directory\" ) True","tags":"Python","url":"redoules.github.io/python/ensure_dir.html","loc":"redoules.github.io/python/ensure_dir.html"},{"title":"List all files in a directory","text":"Listing all the files in a directory Let's start with the basics, the most staigthforward way to list all the files in a direcoty is to use a combinaison of the listdir function and isfile form os.path. You can use a list comprehension to store all the results in a list. mypath = \"./test_directory/\" from os import listdir from os.path import isfile , join [ f for f in listdir ( mypath ) if isfile ( join ( mypath , f ))] ['logfile.log', 'myfile.txt', 'super_music.mp3', 'textfile.txt'] Listing all the files of a certain type in a directory similarly, if you want to filter only a certain kind of file based on its extension you can use the endswith method. In the following example, we will filter all the \"txt\" files contained in the directory [ f for f in listdir ( mypath ) if f . endswith ( '.' + \"txt\" )] ['myfile.txt', 'textfile.txt'] Listing all the files matching a pattern in a directory The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. You can use the *, ?, and character ranges expressed with [] wildcards import glob glob . glob ( \"*.txt\" ) ['myfile.txt'] Listing files recusively If you want to list all files recursively you can select all the sub-directories using the \"**\" wildcard import glob glob . glob ( mypath + '/**/*.txt' , recursive = True ) ['./test_directory\\\\myfile.txt', './test_directory\\\\textfile.txt', './test_directory\\\\subdir1\\\\file_hidden_in_a_sub_direcotry.txt'] Using a regular expression If you'd rather use a regular expression to select the files, the pathlib library provides the rglob function. from pathlib import Path list ( Path ( \"./test_directory/\" ) . rglob ( \"*.[tT][xX][tT]\" )) [WindowsPath('test_directory/myfile.txt'), WindowsPath('test_directory/textfile.txt'), WindowsPath('test_directory/subdir1/file_hidden_in_a_sub_direcotry.txt')] Using regular expressions you can for example select multiple types of files. In the following example, we list all the files that finish either with \"txt\" or with \"log\". list ( Path ( \"./test_directory/\" ) . rglob ( \"*.[tl][xo][tg]\" )) [WindowsPath('test_directory/logfile.log'), WindowsPath('test_directory/myfile.txt'), WindowsPath('test_directory/textfile.txt'), WindowsPath('test_directory/subdir1/file_hidden_in_a_sub_direcotry.txt')]","tags":"Python","url":"redoules.github.io/python/list_files_directory.html","loc":"redoules.github.io/python/list_files_directory.html"},{"title":"Using Dask on infiniband","text":"InfiniBand (abbreviated IB) is a computer-networking communications standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used as either a direct or switched interconnect between servers and storage systems, as well as an interconnect between storage systems. (source Wikipedia). If you want to leverage this high speed network instead of the regular ethernet network, you have to specify to the scheduler that you want to used infiniband as your interface. Assuming that you Infiniband interface is ib0 , you would call the scheduler like this : dask-scheduler --interface ib0 --scheduler-file ./cluster.yaml you would have to call the worker using the same interface : dask-worker --interface ib0 --scheduler-file ./cluster.yaml","tags":"Python","url":"redoules.github.io/python/dask_infiniband.html","loc":"redoules.github.io/python/dask_infiniband.html"},{"title":"Clearing the current cell in the notebook","text":"In python, you can clear the output of a cell by importing the IPython.display module and using the clear_output function from IPython.display import clear_output print ( \"text to be cleared\" ) clear_output () As you can see, the text \"text to be cleared\" is not displayed because the function clear_output has been called afterward","tags":"Jupyter","url":"redoules.github.io/jupyter/clear_cell.html","loc":"redoules.github.io/jupyter/clear_cell.html"},{"title":"What's inside my .bashrc ?","text":"# ~/.bashrc: executed by bash(1) for non-login shells. # see /usr/share/doc/bash/examples/startup-files (in the package bash-doc) # for examples # If not running interactively, don't do anything case $- in *i* ) ;; * ) return ;; esac # don't put duplicate lines or lines starting with space in the history. # See bash(1) for more options HISTCONTROL = ignoreboth # append to the history file, don't overwrite it shopt -s histappend # for setting history length see HISTSIZE and HISTFILESIZE in bash(1) HISTSIZE = 1000 HISTFILESIZE = 2000 # check the window size after each command and, if necessary, # update the values of LINES and COLUMNS. shopt -s checkwinsize # If set, the pattern \"**\" used in a pathname expansion context will # match all files and zero or more directories and subdirectories. #shopt -s globstar # make less more friendly for non-text input files, see lesspipe(1) [ -x /usr/bin/lesspipe ] && eval \" $( SHELL = /bin/sh lesspipe ) \" # set variable identifying the chroot you work in (used in the prompt below) if [ -z \" ${ debian_chroot :- } \" ] && [ -r /etc/debian_chroot ] ; then debian_chroot = $( cat /etc/debian_chroot ) fi # set a fancy prompt (non-color, unless we know we \"want\" color) case \" $TERM \" in xterm-color | *-256color ) color_prompt = yes ;; esac # uncomment for a colored prompt, if the terminal has the capability; turned # off by default to not distract the user: the focus in a terminal window # should be on the output of commands, not on the prompt force_color_prompt = yes if [ -n \" $force_color_prompt \" ] ; then if [ -x /usr/bin/tput ] && tput setaf 1 > & /dev/null ; then # We have color support; assume it's compliant with Ecma-48 # (ISO/IEC-6429). (Lack of such support is extremely rare, and such # a case would tend to support setf rather than setaf.) color_prompt = yes else color_prompt = fi fi if [ \" $color_prompt \" = yes ] ; then PS1 = '${debian_chroot:+($debian_chroot)}\\[\\033[01;32m\\]\\u@\\h\\[\\033[00m\\]:\\[\\033[01;34m\\]\\w\\[\\033[00m\\]\\$ ' else PS1 = '${debian_chroot:+($debian_chroot)}\\u@\\h:\\w\\$ ' fi unset color_prompt force_color_prompt # If this is an xterm set the title to user@host:dir case \" $TERM \" in xterm* | rxvt* ) PS1 = \"\\[\\e]0; ${ debian_chroot :+( $debian_chroot ) } \\u@\\h: \\w\\a\\] $PS1 \" ;; * ) ;; esac # enable color support of ls and also add handy aliases if [ -x /usr/bin/dircolors ] ; then test -r ~/.dircolors && eval \" $( dircolors -b ~/.dircolors ) \" || eval \" $( dircolors -b ) \" alias ls = 'ls --color=auto' #alias dir='dir --color=auto' #alias vdir='vdir --color=auto' alias grep = 'grep --color=auto' alias fgrep = 'fgrep --color=auto' alias egrep = 'egrep --color=auto' fi # colored GCC warnings and errors #export GCC_COLORS='error=01;31:warning=01;35:note=01;36:caret=01;32:locus=01:quote=01' # some more ls aliases alias ll = 'ls -alF' alias la = 'ls -A' alias l = 'ls -CF' # Add an \"alert\" alias for long running commands. Use like so: # sleep 10; alert alias alert = 'notify-send --urgency=low -i \"$([ $? = 0 ] && echo terminal || echo error)\" \"$(history|tail -n1|sed -e ' \\' 's/^\\s*[0-9]\\+\\s*//;s/[;&|]\\s*alert$//' \\' ')\"' # Alias definitions. # You may want to put all your additions into a separate file like # ~/.bash_aliases, instead of adding them here directly. # See /usr/share/doc/bash-doc/examples in the bash-doc package. if [ -f ~/.bash_aliases ] ; then . ~/.bash_aliases fi # enable programmable completion features (you don't need to enable # this, if it's already enabled in /etc/bash.bashrc and /etc/profile # sources /etc/bash.bashrc). if ! shopt -oq posix ; then if [ -f /usr/share/bash-completion/bash_completion ] ; then . /usr/share/bash-completion/bash_completion elif [ -f /etc/bash_completion ] ; then . /etc/bash_completion fi fi # export PATH=\"/home/guillaume/anaconda3/bin:$PATH\" # commented out by conda initialize #source activate base # >>> conda initialize >>> # !! Contents within this block are managed by 'conda init' !! __conda_setup = \" $( '/home/guillaume/anaconda3/bin/conda' 'shell.bash' 'hook' 2 > /dev/null ) \" if [ $? -eq 0 ] ; then eval \" $__conda_setup \" else if [ -f \"/home/guillaume/anaconda3/etc/profile.d/conda.sh\" ] ; then . \"/home/guillaume/anaconda3/etc/profile.d/conda.sh\" else export PATH = \"/home/guillaume/anaconda3/bin: $PATH \" fi fi unset __conda_setup # <<< conda initialize <<< source activate base","tags":"Linux","url":"redoules.github.io/linux/bashrc.html","loc":"redoules.github.io/linux/bashrc.html"},{"title":"Efficient extraction of eigenvalues from a list of tensors","text":"When you manipulate FEM results you generally have either a: * scalar field, * vector field, * tensor field. With tensorial results, it is often useful to extract the eigenvalues in order to find the principal values. I have found that it is easier to store the components of the tensors in a 6 column pandas dataframe (because of the symmetric property of stress and strain tensors) import pandas as pd node = [ 1001 , 1002 , 1003 , 1004 ] #when dealing with FEM results you should remember at which element/node the result is computed (in the example, let's assume that we look at node from 1001 to 1004) tensor1 = [ 1 , 1 , 1 , 0 , 0 , 0 ] #eigen : 1 tensor2 = [ 4 , - 1 , 0 , 2 , 2 , 1 ] #eigen : 5.58443, -1.77931, -0.805118 tensor3 = [ 1 , 6 , 5 , 3 , 3 , 1 ] #eigen : 8.85036, 4.46542, -1.31577 tensor4 = [ 1 , 2 , 3 , 0 , 0 , 0 ] #eigen : 1, 2, 3 df = pd . DataFrame ([ tensor1 , tensor2 , tensor3 , tensor4 ], columns = [ \"XX\" , \"YY\" , \"ZZ\" , \"XY\" , \"XZ\" , \"YZ\" ]) df . index = node df XX YY ZZ XY XZ YZ 1001 1 1 1 0 0 0 1002 4 -1 0 2 2 1 1003 1 6 5 3 3 1 1004 1 2 3 0 0 0 If you want to extract the eigenvalues of a tensor with numpy you have to pass a n by n ndarray to the eigenvalue function. In order to avoid having to loop over each node, this oneliner is highly optimized and will help you invert a large number of tensors efficiently. The steps are basically, create a list of n by n values (here n=3) in the right order => reshape it to a list of tensors => pass it to the eigenvals function import numpy as np from numpy import linalg as LA eigenvals = LA . eigvals ( df [[ \"XX\" , \"XY\" , \"XZ\" , \"XY\" , \"YY\" , \"YZ\" , \"XZ\" , \"YZ\" , \"ZZ\" ]] . values . reshape ( len ( df ), 3 , 3 )) eigenvals array([[ 1. , 1. , 1. ], [ 5.58442834, -0.80511809, -1.77931025], [-1.31577211, 8.85035616, 4.46541595], [ 1. , 2. , 3. ]])","tags":"Python","url":"redoules.github.io/python/Efficient_extraction_of_eigenvalues_from_a_list_of_tensors.html","loc":"redoules.github.io/python/Efficient_extraction_of_eigenvalues_from_a_list_of_tensors.html"},{"title":"Optimized numpy random number generation on Intel CPU","text":"Python Intel distribution Make sure you have a python intel distribution. When you startup python you should see somethine like : Python 3.6.2 |Intel Corporation| (default, Aug 15 2017, 11:34:02) [MSC v.1900 64 bit (AMD64)] Type 'copyright', 'credits' or 'license' for more information IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help. If not, you can force the installation of the intel optimized python with : conda update --all conda config --add channels intel conda install numpy --channel intel --override-channels oh and by the way, make sure you a running an Intel CPU ;) Comparing numpy.random with numpy.random_intel Let's now test both the rand function with and without the Intel optimization import numpy as np from numpy import random , random_intel % timeit np . random . rand ( 10 ** 5 ) 1.06 ms ± 91.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) % timeit np . random_intel . rand ( 10 ** 5 ) 225 µs ± 3.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)","tags":"Python","url":"redoules.github.io/python/Optimized_numpy_random_intel.html","loc":"redoules.github.io/python/Optimized_numpy_random_intel.html"},{"title":"How to check Linux process information?","text":"How to check Linux process information (CPU usage, memory, user information, etc.)? You need to use the ps command combined with the grep command. In the example, we want to check the information on the nginx process : ps aux | grep nginx It would return the output : root 9976 0.0 0.0 12272 108 ? S<s Aug12 0:00 nginx: master process /usr/bin/nginx -g pid /run/nginx.pid; daemon on; master_process on; http 16780 0.0 0.0 12384 684 ? S< Aug12 4:11 nginx: worker process http 16781 0.0 0.0 12556 708 ? S< Aug12 0:24 nginx: worker process http 16782 0.0 0.1 12292 744 ? S< Aug12 2:43 nginx: worker process http 16783 0.0 0.1 12276 872 ? S< Aug12 0:24 nginx: worker process admin 17612 0.0 0.1 5120 864 pts/4 S+ 11:22 0:00 grep --color=auto nginx The columns have the following order : USER;PID;%CPU;%MEM;VSZ;RSS;TTY;STAT;START;TIME;COMMAND USER = user owning the process PID = process ID of the process %CPU = It is the CPU time used divided by the time the process has been running. %MEM = ratio of the process's resident set size to the physical memory on the machine VSZ = virtual memory usage of entire process (in KiB) RSS = resident set size, the non-swapped physical memory that a task has used (in KiB) TTY = controlling tty (terminal) STAT = multi-character process state START = starting time or date of the process TIME = cumulative CPU time COMMAND = command with all its arguments Interactive display If you want an interactive display showing in real time the statistics of the running process you can use the top command. If htop is available on you system, use this instead.","tags":"Linux","url":"redoules.github.io/linux/linux_process_information.html","loc":"redoules.github.io/linux/linux_process_information.html"},{"title":"Check the size of a directory","text":"How do you Check the size of a directory in linux? The du command will come handy for this task. Let's say we want to know the size of the directory named recommandations , we would run the following command du -sh recommendations It would return the output : 9.9M recommendations","tags":"Linux","url":"redoules.github.io/linux/directory_size.html","loc":"redoules.github.io/linux/directory_size.html"},{"title":"Check for free disk space","text":"How do you check for free disk space on linux? The df command will come handy for this task. Run the command with the following arguments df -ah Will return in a human readable format for all drives a readout of all your filesystems Filesystem Size Used Avail Use% Mounted on /dev/md0 2.4G 1.3G 1.1G 54% / none 348M 4.0K 348M 1% /dev none 0 0 0 - /dev/pts none 0 0 0 - /proc none 0 0 0 - /sys /tmp 350M 1.3M 349M 1% /tmp /run 350M 3.2M 347M 1% /run /dev/shm 350M 12K 350M 1% /dev/shm /proc/bus/usb 0 0 0 - /proc/bus/usb securityfs 0 0 0 - /sys/kernel/security /dev/md3 1.8T 372G 1.5T 21% /volume2 /dev/vg1000/lv 1.8T 1.5T 340G 82% /volume1 /dev/sdq1 7.4G 3.8G 3.5G 52% /volumeUSB3/usbshare3-1 /dev/sdr 294G 146G 134G 53% /volumeUSB2/usbshare none 0 0 0 - /proc/fs/nfsd none 0 0 0 - /config The free disk space can be read in the Avail column","tags":"Linux","url":"redoules.github.io/linux/free_disk_space_linux.html","loc":"redoules.github.io/linux/free_disk_space_linux.html"},{"title":"Check your current ip address","text":"Check your current ip address Run the command ip addr show and that will give you every information available 4: eth0: <> mtu 1500 group default qlen 1 link/ether 5c:51:4f:41:7a:b1 inet 169.254.33.33/16 brd 169.254.255.255 scope global dynamic valid_lft forever preferred_lft forever inet6 fe80::390a:f69e:1ba2:2121/64 scope global dynamic valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP> mtu 1500 group default qlen 1 link/ether 0a:00:27:00:00:03 inet 192.168.56.1/24 brd 192.168.56.255 scope global dynamic valid_lft forever preferred_lft forever inet6 fe80::84cd:374e:843f:f82f/64 scope global dynamic valid_lft forever preferred_lft forever 15: eth2: <> mtu 1500 group default qlen 1 link/ether 00:ff:d2:8a:19:c3 inet 169.254.40.62/16 brd 169.254.255.255 scope global dynamic valid_lft forever preferred_lft forever inet6 fe80::f199:2ca0:7aff:283e/64 scope global dynamic valid_lft forever preferred_lft forever 1: lo: <LOOPBACK,UP> mtu 1500 group default qlen 1 link/loopback 00:00:00:00:00:00 inet 127.0.0.1/8 brd 127.255.255.255 scope global dynamic valid_lft forever preferred_lft forever inet6 ::1/128 scope global dynamic valid_lft forever preferred_lft forever 5: wifi0: <BROADCAST,MULTICAST,UP> mtu 1500 group default qlen 1 link/ieee802.11 5c:51:4f:41:7a:ad inet 192.168.1.1/24 brd 192.168.1.255 scope global dynamic valid_lft 42720sec preferred_lft 42720sec inet6 fe80::395f:3594:1dc2:57e3/64 scope global dynamic valid_lft forever preferred_lft forever 21: wifi1: <> mtu 1500 group default qlen 1 link/ieee802.11 5c:51:4f:41:7a:ae inet 169.254.12.77/16 brd 169.254.255.255 scope global dynamic valid_lft forever preferred_lft forever inet6 fe80::58d5:630:cbbd:c4d/64 scope global dynamic valid_lft forever preferred_lft forever 12: eth3: <> mtu 1472 group default qlen 1 link/ether 00:00:00:00:00:00:00:e0:00:00:00:00:00:00:00:00 inet6 fe80::100:7f:fffe/64 scope global dynamic valid_lft forever preferred_lft forever 10: eth4: <BROADCAST,MULTICAST,UP> mtu 1500 group default qlen 1 link/ether 22:b7:57:52:5f:ff inet 192.168.42.106/24 brd 192.168.42.255 scope global dynamic valid_lft 6659sec preferred_lft 6659sec inet6 fe80::5110:eb6f:deb0:45c4/64 scope global dynamic valid_lft forever preferred_lft forever You can select only one interface ip addr show eth0 and only the relevant information will be displayed 4: eth0: <> mtu 1500 group default qlen 1 link/ether 5c:51:4f:41:7a:b1 inet 169.254.33.33/16 brd 169.254.255.255 scope global dynamic valid_lft forever preferred_lft forever inet6 fe80::390a:f69e:1ba2:2121/64 scope global dynamic valid_lft forever preferred_lft forever Check your current ip address the old way The ifconfig command will return information regarding your network interfaces. Let's try it: ifconfig eth1 Link encap:Ethernet HWaddr 0a:00:27:00:00:03 inet adr:192.168.56.1 Bcast:192.168.56.255 Masque:255.255.255.0 adr inet6: fe80::84cd:374e:843f:f82f/64 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Packets reçus:0 erreurs:0 :0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B) eth4 Link encap:Ethernet HWaddr 22:b7:57:52:5f:ff inet adr:192.168.42.106 Bcast:192.168.42.255 Masque:255.255.255.0 adr inet6: fe80::5110:eb6f:deb0:45c4/64 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Packets reçus:0 erreurs:0 :0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B) lo Link encap:Boucle locale inet adr:127.0.0.1 Masque:255.0.0.0 adr inet6: ::1/128 Scope:Global UP LOOPBACK RUNNING MTU:1500 Metric:1 Packets reçus:0 erreurs:0 :0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B) wifi0 Link encap:UNSPEC HWaddr 5C-51-4F-41-7A-AD-00-00-00-00-00-00-00-00-00-00 inet adr:192.168.1.1 Bcast:192.168.1.255 Masque:255.255.255.0 adr inet6: fe80::395f:3594:1dc2:57e3/64 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 Packets reçus:0 erreurs:0 :0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B) On the left column you have the list of network adapters. The lo is the local loopback, it is an interface that points to the localhost. Interfaces starting with eth refer to wired connections over ethernet (or sometimes USB in the case of a phone acting as an access point over USB). Interfaces starting with wlan or wifi refer to wireless connections. On the right column you have some information corresponding to the interface such as the IPv4, the IPv6, the mask, some statistics about the interface and so on.","tags":"Linux","url":"redoules.github.io/linux/get_ip_linux.html","loc":"redoules.github.io/linux/get_ip_linux.html"},{"title":"Check the version of the kernel currently running","text":"Check the version of the kernel currently running The uname command will give you the version of the kernel. In order to get a more useful output, type uname -a This will return : * the hostname * os name * kernel release/version * architecture * etc. variations If you only want the kernel version you can type uname -v if you only want the kernel release you can type uname -r","tags":"Linux","url":"redoules.github.io/linux/version_kernel.html","loc":"redoules.github.io/linux/version_kernel.html"},{"title":"Running the notebook on a remote server","text":"Jupyter hub With JupyterHub you can create a multi-user Hub which spawns, manages, and proxies multiple instances of the single-user Jupyter notebook server. Project Jupyter created JupyterHub to support many users. The Hub can offer notebook servers to a class of students, a corporate data science workgroup, a scientific research project, or a high performance computing group. However, if you are the only one using the server and you just want a simple way to run the notebook on your server and access it through the web interface on a light client without having to install and configure the jupyter hub, you can do the following. Problem with jupyter notebook On your server, run the command jupyter-notebook you should get something like : [I 11:18:44.514 NotebookApp] Serving notebooks from local directory: /volume2/homes/admin [I 11:18:44.515 NotebookApp] 0 active kernels [I 11:18:44.516 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/?token=357587e3269b0f20f2b7e1918492890ae7573ac7ef1d2023 [I 11:18:44.516 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [W 11:18:44.519 NotebookApp] No web browser found: could not locate runnable browser. [C 11:18:44.520 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:8888/?token=357587e3269b0f20f2b7e1918492890ae7573ac7ef1d2023 and if you try to connect to your server ip (in my example : http://192.168.1.2:8888/?token=357587e3269b0f20f2b7e1918492890ae7573ac7ef1d2023) you will get an \"ERR_CONNECTION_REFUSED\" error. This is because, by default, Jupyter Notebook only accepts connections from localhost. Allowing connexions from other sources From any IP The simplest way to avoid the connection error is to allow the notebook to accept connections from any ip jupyter-notebook --ip = * you will get something like [W 11:26:45.285 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended. [I 11:26:45.342 NotebookApp] Serving notebooks from local directory: /volume2/homes/admin [I 11:26:45.342 NotebookApp] 0 active kernels [I 11:26:45.343 NotebookApp] The Jupyter Notebook is running at: http://[all ip addresses on your system]:8888/?token=52af33d628881824968b4031967e8541a27cc28b1720c199 [I 11:26:45.343 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [W 11:26:45.346 NotebookApp] No web browser found: could not locate runnable browser. [C 11:26:45.347 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:8888/?token=52af33d628881824968b4031967e8541a27cc28b1720c199 and if you connect form a remote client (192.168.1.1 in my example), the following line will be added to the output : [I 11:26:54.798 NotebookApp] 302 GET /?token=52af33d628881824968b4031967e8541a27cc28b1720c199 (192.168.1.1) 111.17ms note that you should only do that if you are the only one using the server because the connection is not encypted. From a specific IP You can also, explicitly specify the ip of the client jupyter-notebook --ip = 192 .168.1.1 [I 11:44:58.104 NotebookApp] JupyterLab extension loaded from C:\\Users\\Guillaume\\Miniconda3\\lib\\site-packages\\jupyterlab [I 11:44:58.104 NotebookApp] JupyterLab application directory is C:\\Users\\Guillaume\\Miniconda3\\share\\jupyter\\lab [I 11:44:58.244 NotebookApp] Serving notebooks from local directory: C:\\Users\\Guillaume [I 11:44:58.245 NotebookApp] 0 active kernels [I 11:44:58.245 NotebookApp] The Jupyter Notebook is running at: http://192.168.1.1:8888/?token=503576dd8fa87d1f2c416df307e9b900e520b4942e317b32 [I 11:44:58.245 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 11:44:58.258 NotebookApp] Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://192.168.1.1:8888/?token=503576dd8fa87d1f2c416df307e9b900e520b4942e317b32 [I 11:44:59.083 NotebookApp] Accepting one-time-token-authenticated connection from 192.168.1.1","tags":"Jupyter","url":"redoules.github.io/jupyter/remote_run_notebook.html","loc":"redoules.github.io/jupyter/remote_run_notebook.html"},{"title":"Running multiple calls to a function in parallel with Dask","text":"Dask.distributed is a lightweight library for distributed computing in Python. It allows to create a compute graph. Dask distributed is architectured around 3 parts : the dask-scheduler the dask-worker(s) the dask client Dask architecture The Dask scheduler is a centrally managed, distributed, dynamic task scheduler. It recieves tasks from a/multiple client(s) and spread them across one or multiple dask-worker(s). Dask-scheduler is an event based asynchronous dynamic scheduler, meaning that mutliple clients can submit a list of task to be executed on multiple workers. Internally, the task are represented as a directed acyclic graph. Both new clients and new workers can be connected or disconnected during the execution of the task graph. Tasks can be submited with the function client . submit ( function , * args , ** kwargs ) or by using objects from the dask library such as dask.dataframe, dask.bag or dask.array Setup In this example, we will use a distributed scheduler on a single machine with multiple workers and a single client. We will use the client to submit some tasks to the scheduler. The scheduler will then dispatch those tasks to the workers. The process can be monitored in real time through a web application. For this example, all the computations will be run on a local computer. However dask can scale to a large HPC cluster. First we have to launch the dask-scheduler; from the command line, input dask-scheduler Next, you can load the web dashboard. In order to do so, the scheduler returns the number of the port you have to connect to in the line starting with \"bokeh at :\". The default port is 8787. Since we are running all the programs on the same computer, we just have to login to http://127.0.0.1:8787/status Finally, we have to launch the dask-worker(s). If you want to run the worker(s) on the same computer as the scheduler the type : dask-worker 127 .0.0.1:8786 otherwise, make sure you are inputing the ip address of the computer hosting the dask-scheduler. You can launch as many workers as you want. In this example, we will run 3 workers on the local machine. Use the dask workers within your python code We will now see how to submit multiple calls to a fucntion in parallel on the dask-workers. Import the required libraries and define the function to be executed. import numpy as np import pandas as pd from distributed import Client #function used to do parallel computing on def compute_pi_MonteCarlo ( Nb_Data ): \"\"\" computes the value of pi using the monte carlo method \"\"\" Radius = 1 Nb_Data = int ( round ( Nb_Data )) x = np . random . uniform ( - Radius , Radius , Nb_Data ) y = np . random . uniform ( - Radius , Radius , Nb_Data ) pi_mc = 4 * np . sum ( np . power ( x , 2 ) + np . power ( y , 2 ) < Radius ** 2 ) / Nb_Data err = 100 * np . abs ( pi_mc - np . pi ) / np . pi return [ Nb_Data , pi_mc , err ] In order to connect to the scheduler, we create a client. client = Client ( '127.0.0.1:8786' ) client Client Scheduler: tcp://127.0.0.1:8786 Dashboard: http://127.0.0.1:8787/status Cluster Workers: 3 Cores: 12 Memory: 25.48 GB We submit tasks using the submit method data = [ client . submit ( compute_pi_MonteCarlo , Nb_Data ) for Nb_Data in np . logspace ( 3 , 7 , num = 1200 , dtype = int )] If you look at http://127.0.0.1:8787/status you will see the tasks beeing completed. Once competed, gather the data: data = client . gather ( data ) df = pd . DataFrame ( data ) df . columns = [ \"number of points for MonteCarlo\" , \"value of pi\" , \"error (%)\" ] df . tail () number of points for MonteCarlo value of pi error (%) 1195 9697405 3.141296 0.009454 1196 9772184 3.141058 0.017008 1197 9847540 3.141616 0.000739 1198 9923477 3.141009 0.018574 1199 10000000 3.141032 0.017833 There, we have completed a simple example on how to use dask to run multiple functions in parallel. Full source code: import numpy as np import pandas as pd from distributed import Client #function used to do parallel computing on def compute_pi_MonteCarlo ( Nb_Data ): \"\"\" computes the value of pi using the monte carlo method \"\"\" Radius = 1 Nb_Data = int ( round ( Nb_Data )) x = np . random . uniform ( - Radius , Radius , Nb_Data ) y = np . random . uniform ( - Radius , Radius , Nb_Data ) pi_mc = 4 * np . sum ( np . power ( x , 2 ) + np . power ( y , 2 ) < Radius ** 2 ) / Nb_Data err = 100 * np . abs ( pi_mc - np . pi ) / np . pi return [ Nb_Data , pi_mc , err ] #connect to the scheduler client = Client ( '127.0.0.1:8786' ) #submit tasks data = [ client . submit ( compute_pi_MonteCarlo , Nb_Data ) for Nb_Data in np . logspace ( 3 , 7 , num = 1200 , dtype = int )] #gather the results data = client . gather ( data ) df = pd . DataFrame ( data ) df . columns = [ \"number of points for MonteCarlo\" , \"value of pi\" , \"error (%)\" ] df . tail () A word on the environement variables On Windows, to make sure that you can run dask-scheduler and dask-worker from the command line, you have to add the location of the executable to your path. On linux, you can append the location of the dask-worker and scheduler to the path variable with the command export PATH = $PATH :/path/to/dask","tags":"Python","url":"redoules.github.io/python/dask_distributed_parallelism.html","loc":"redoules.github.io/python/dask_distributed_parallelism.html"},{"title":"Plotting data using log axis","text":"Plotting in log axis with matplotlib import matplotlib.pyplot as plt % matplotlib inline import numpy as np x = np . linspace ( 0.1 , 20 ) y = 20 * np . exp ( - x / 10.0 ) Plotting using the standard function then specifying the axis scale One of the easiest way to plot in a log plot is to specify the plot normally and then specify which axis is to be plotted with a log scale. This can be specified by the function set_xscale or set_yscale # Normal plot fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . plot ( x , y ) ax . grid () plt . show () # Log x axis plot fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . plot ( x , y ) ax . set_xscale ( 'log' ) ax . grid () plt . show () # Log x axis plot fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . plot ( x , y ) ax . set_yscale ( 'log' ) ax . grid () plt . show () # Log x axis plot fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . plot ( x , y ) ax . set_xscale ( 'log' ) ax . set_yscale ( 'log' ) ax . grid () plt . show () Plotting using the matplotlib defined function Matplotlib has the function : semilogx, semilogy and loglog that can help you avoid having to specify the axis scale. # Plot using semilogx fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . semilogx ( x , y ) ax . grid () plt . show () # Plot using semilogy fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . semilogy ( x , y ) ax . grid () plt . show () # Plot using loglog fig = plt . figure () ax = fig . add_subplot ( 1 , 1 , 1 ) ax . loglog ( x , y ) ax . grid () plt . show ()","tags":"Python","url":"redoules.github.io/python/logplot.html","loc":"redoules.github.io/python/logplot.html"},{"title":"Downloading a static webpage with python","text":"If you are using python legacy (aka python 2) first of all, stop ! Furthermore, this method won't work in python legacy # Import modules from urllib.request import urlopen The webpage source code can be downloaded with the command urlopen url = \"http://example.com/\" #create a HTTP request in order to read the page page = urlopen ( url ) . read () The source code will be stored in the variable page as a string print ( page ) b' <!doctype html> \\n <html> \\n <head> \\n <title> Example Domain </title> \\n\\n <meta charset= \"utf-8\" /> \\n <meta http-equiv= \"Content-type\" content= \"text/html; charset=utf-8\" /> \\n <meta name= \"viewport\" content= \"width=device-width, initial-scale=1\" /> \\n <style type= \"text/css\" > \\n body {\\n background-color: #f0f0f2;\\n margin: 0;\\n padding: 0;\\n font-family: \"Open Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;\\n \\n }\\n div {\\n width: 600px;\\n margin: 5em auto;\\n padding: 50px;\\n background-color: #fff;\\n border-radius: 1em;\\n }\\n a:link, a:visited {\\n color: #38488f;\\n text-decoration: none;\\n }\\n @media (max-width: 700px) {\\n body {\\n background-color: #fff;\\n }\\n div {\\n width: auto;\\n margin: 0 auto;\\n border-radius: 0;\\n padding: 1em;\\n }\\n }\\n </style> \\n </head> \\n\\n <body> \\n <div> \\n <h1> Example Domain </h1> \\n <p> This domain is established to be used for illustrative examples in documents. You may use this\\n domain in examples without prior coordination or asking for permission. </p> \\n <p><a href= \"http://www.iana.org/domains/example\" > More information... </a></p> \\n </div> \\n </body> \\n </html> \\n' Additionally, you can beautifulsoup in order to make it easier to work with html from bs4 import BeautifulSoup soup = BeautifulSoup ( page , 'lxml' ) soup . prettify () print ( soup ) <!DOCTYPE html> < html > < head > < title > Example Domain </ title > < meta charset = \"utf-8\" /> < meta content = \"text/html; charset=utf-8\" http-equiv = \"Content-type\" /> < meta content = \"width=device-width, initial-scale=1\" name = \"viewport\" /> < style type = \"text/css\" > body { background-color : #f0f0f2 ; margin : 0 ; padding : 0 ; font-family : \"Open Sans\" , \"Helvetica Neue\" , Helvetica , Arial , sans-serif ; } div { width : 600 px ; margin : 5 em auto ; padding : 50 px ; background-color : #fff ; border-radius : 1 em ; } a : link , a : visited { color : #38488f ; text-decoration : none ; } @ media ( max-width : 700px ) { body { background-color : #fff ; } div { width : auto ; margin : 0 auto ; border-radius : 0 ; padding : 1 em ; } } </ style > </ head > < body > < div > < h1 > Example Domain </ h1 > < p > This domain is established to be used for illustrative examples in documents. You may use this domain in examples without prior coordination or asking for permission. </ p > < p >< a href = \"http://www.iana.org/domains/example\" > More information... </ a ></ p > </ div > </ body > </ html >","tags":"Python","url":"redoules.github.io/python/download_page.html","loc":"redoules.github.io/python/download_page.html"},{"title":"Getting stock market data","text":"Start by importing the packages. We will need pandas and the pandas_datareader. # Import modules import pandas as pd from pandas_datareader import data Datareader allows you to import data from the internet. I have found that Quandl and robinhood works the best as a source for stockmarket data. Note that if you want an other type of data (e.g. GDP, inflation, etc.) other sources exist. #import stock from robinhood aapl_robinhood = data . DataReader ( 'AAPL' , 'robinhood' , '1980-01-01' ) aapl_robinhood . head () close_price high_price interpolated low_price open_price session volume symbol begins_at AAPL 2017-08-04 153.996200 154.990700 False 153.306900 153.681100 reg 20559852 2017-08-07 156.379100 156.487400 False 154.272000 154.655900 reg 21870321 2017-08-08 157.629700 159.352900 False 155.847400 156.172300 reg 36205896 2017-08-09 158.594700 158.801500 False 156.674500 156.822200 reg 26131530 2017-08-10 153.543100 158.169600 False 152.861000 158.070700 reg 40804273 #import stock from quandl aapl_quandl = data . DataReader ( 'AAPL' , 'quandl' , '1980-01-01' ) aapl_quandl . head () Open High Low Close Volume ExDividend SplitRatio AdjOpen AdjHigh AdjLow AdjClose AdjVolume Date 2018-03-27 173.68 175.15 166.92 168.340 38962839.0 0.0 1.0 173.68 175.15 166.92 168.340 38962839.0 2018-03-26 168.07 173.10 166.44 172.770 36272617.0 0.0 1.0 168.07 173.10 166.44 172.770 36272617.0 2018-03-23 168.39 169.92 164.94 164.940 40248954.0 0.0 1.0 168.39 169.92 164.94 164.940 40248954.0 2018-03-22 170.00 172.68 168.60 168.845 41051076.0 0.0 1.0 170.00 172.68 168.60 168.845 41051076.0 2018-03-21 175.04 175.09 171.26 171.270 35247358.0 0.0 1.0 175.04 175.09 171.26 171.270 35247358.0","tags":"Python","url":"redoules.github.io/python/stock_pandas.html","loc":"redoules.github.io/python/stock_pandas.html"},{"title":"Moving average with pandas","text":"# Import modules import pandas as pd from pandas_datareader import data , wb #import packages from pandas_datareader import data aapl = data . DataReader ( 'AAPL' , 'quandl' , '1980-01-01' ) aapl . head () Open High Low Close Volume ExDividend SplitRatio AdjOpen AdjHigh AdjLow AdjClose AdjVolume Date 2018-03-27 173.68 175.15 166.92 168.340 38962839.0 0.0 1.0 173.68 175.15 166.92 168.340 38962839.0 2018-03-26 168.07 173.10 166.44 172.770 36272617.0 0.0 1.0 168.07 173.10 166.44 172.770 36272617.0 2018-03-23 168.39 169.92 164.94 164.940 40248954.0 0.0 1.0 168.39 169.92 164.94 164.940 40248954.0 2018-03-22 170.00 172.68 168.60 168.845 41051076.0 0.0 1.0 170.00 172.68 168.60 168.845 41051076.0 2018-03-21 175.04 175.09 171.26 171.270 35247358.0 0.0 1.0 175.04 175.09 171.26 171.270 35247358.0 In order to computer the moving average, we will use the rolling function. #120 days moving average moving_averages = aapl [[ \"Open\" , \"High\" , \"Low\" , \"Close\" , \"Volume\" ]] . rolling ( window = 120 ) . mean () moving_averages . tail () Open High Low Close Volume Date 1980-12-18 28.457667 28.551917 28.385000 28.385000 139495.000000 1980-12-17 28.410750 28.502917 28.338083 28.338083 141772.500000 1980-12-16 28.362833 28.453917 28.289167 28.289167 141256.666667 1980-12-15 28.335750 28.426833 28.262083 28.262083 144321.666667 1980-12-12 28.310750 28.402833 28.238167 28.238167 159625.000000 % matplotlib inline import matplotlib.pyplot as plt plt . plot ( aapl . index , aapl . Open , label = 'Open price' ) plt . plot ( moving_averages . index , moving_averages . Open , label = \"120 MA Open price\" ) plt . legend () plt . show ()","tags":"Python","url":"redoules.github.io/python/Moving_average_pandas.html","loc":"redoules.github.io/python/Moving_average_pandas.html"},{"title":"Keywords to use with WHERE","text":"Keywords to use with WHERE #load the extension % load_ext sql #connect to the database % sql sqlite : /// mydatabase . db 'Connected: @mydatabase.db' Assignment operator The assignment operator is =. % sql SELECT * FROM tutyfrutty WHERE color = \"red\" * sqlite:///mydatabase.db Done. index fruit color kcal 2 Apple red 52 7 Cranberry red 308 Comparison operators Comparison operation can be done in a SQL querry. They are the following : Equality : = Greater than : > greater than or equal to : >= less than : < less than or equal to : <= not equal to : <>, != not greater than : !> not less than : !< % sql SELECT * FROM tutyfrutty WHERE kcal = 47 * sqlite:///mydatabase.db Done. index fruit color kcal 1 Orange orange 47 % sql SELECT * FROM tutyfrutty WHERE kcal > 47 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 2 Apple red 52 7 Cranberry red 308 % sql SELECT * FROM tutyfrutty WHERE kcal >= 47 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 7 Cranberry red 308 % sql SELECT * FROM tutyfrutty WHERE kcal < 47 * sqlite:///mydatabase.db Done. index fruit color kcal 3 lemon yellow 15 4 lime green 30 5 plum purple 28 % sql SELECT * FROM tutyfrutty WHERE kcal <= 47 * sqlite:///mydatabase.db Done. index fruit color kcal 1 Orange orange 47 3 lemon yellow 15 4 lime green 30 5 plum purple 28 % sql SELECT * FROM tutyfrutty WHERE kcal <> 47 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 7 Cranberry red 308 Logical operators Logical operators test a condition and return a boolean. The logicial operators in SQL are : ALL : true if all the condtions are true AND : true is both conditions are true ANY : true if any one of the conditions are true BETWEEN : true if the operand in withing a range of values EXISTS : true if the subquery contains any rows IN : true if the condition is present in a row LIKE : true if a pattern is matched NOT : True if the operand is false, false otherwise OR : True is either condition is true SOME : true is any of the conditions is true % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" AND kcal < 100 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 3 lemon yellow 15 % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" OR kcal > 300 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 3 lemon yellow 15 7 Cranberry red 308 % sql SELECT * FROM tutyfrutty WHERE fruit LIKE 'l%' * sqlite:///mydatabase.db Done. index fruit color kcal 3 lemon yellow 15 4 lime green 30 % sql SELECT * FROM tutyfrutty WHERE NOT color = \"yellow\" * sqlite:///mydatabase.db Done. index fruit color kcal 1 Orange orange 47 2 Apple red 52 4 lime green 30 5 plum purple 28 7 Cranberry red 308 % sql SELECT * FROM tutyfrutty WHERE kcal BETWEEN 40 AND 100 * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 Bitwise operators Some bitwise operators exist in SQL. They will not be demonstrated here. They are the following : AND : & OR : | XOR : ^ NOT : ~","tags":"SQL","url":"redoules.github.io/sql/WHERE_SQL_keywords.html","loc":"redoules.github.io/sql/WHERE_SQL_keywords.html"},{"title":"Sorting results","text":"Sorting results in SQL Sorting results can be achieved by using a modifier command at the end of the SQL querry #load the extension % load_ext sql #connect to the database % sql sqlite : /// mydatabase . db 'Connected: @mydatabase.db' The results can be sorted with the command ORDER BY SELECT column-list FROM table_name [WHERE condition] [ORDER BY column1, column2, .. columnN] [ASC | DESC] Let's show an example where we extract the fruits that are either yellow or red % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" OR color = \"red\" * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 2 Apple red 52 3 lemon yellow 15 7 Cranberry red 308 Ascending sort % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" OR color = \"red\" ORDER BY kcal ASC * sqlite:///mydatabase.db Done. index fruit color kcal 3 lemon yellow 15 2 Apple red 52 0 Banana yellow 89 7 Cranberry red 308 descending sort % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" OR color = \"red\" ORDER BY kcal DESC * sqlite:///mydatabase.db Done. index fruit color kcal 7 Cranberry red 308 0 Banana yellow 89 2 Apple red 52 3 lemon yellow 15 Sort by multiple columns You can sort by more than one column. Just specify multiple columns in the ORDER BY keyword. In the example, we will sort alphabetically on the color column first and sort alphabetically on the fruit column % sql SELECT * FROM tutyfrutty ORDER BY color , fruit ASC * sqlite:///mydatabase.db Done. index fruit color kcal 4 lime green 30 1 Orange orange 47 5 plum purple 28 2 Apple red 52 7 Cranberry red 308 0 Banana yellow 89 3 lemon yellow 15","tags":"SQL","url":"redoules.github.io/sql/Sorting_results.html","loc":"redoules.github.io/sql/Sorting_results.html"},{"title":"Filter content of a TABLE","text":"Filter content of a TABLE in SQL In this example, we will display the content of a table but we will filter out the results. Since we are working in the notebook, we will load the sql extension in order to manipulate the database. The database mydatabase.db is a SQLite database already created before the example. #load the extension % load_ext sql #connect to the database % sql sqlite : /// mydatabase . db 'Connected: @mydatabase.db' Filter content matching exactly a condition We want to extract all the entries in a dataframe that match a certain condition, in order to do so, we will use the following command : SELECT * FROM TABLE WHERE column=\"condition\" In our example, we will filter all the entries in the tutyfrutty table whose color is yellow % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 3 lemon yellow 15 Complex conditions You can build more complex conditions by using the keywords OR and AND In the following example, we will filter all entries that are either yellow or red % sql SELECT * FROM tutyfrutty WHERE color = \"yellow\" OR color = \"red\" * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 2 Apple red 52 3 lemon yellow 15 7 Cranberry red 308 Note : when combining multiple conditions with AND and OR, be careful to use parentesis where needed Conditions matching a pattern You can also use the LIKE keyword in order to find all entries that match a certain pattern. In our example, we want to find all fruits begining with a \"l\". In order to do so, we will use the LIKE keyword and the wildcard \"%\" meaning any string % sql SELECT * FROM tutyfrutty WHERE fruit LIKE \"l%\" * sqlite:///mydatabase.db Done. index fruit color kcal 3 lemon yellow 15 4 lime green 30 Numerical conditions When we are working with numerical data, we can use the GREATER THAN > and SMALLER THAN < operators % sql SELECT * FROM tutyfrutty WHERE kcal < 47 * sqlite:///mydatabase.db Done. index fruit color kcal 3 lemon yellow 15 4 lime green 30 5 plum purple 28 If we want the condition to be inclusive we can use the operator <= (alternatively >=) % sql SELECT * FROM tutyfrutty WHERE kcal <= 47 * sqlite:///mydatabase.db Done. index fruit color kcal 1 Orange orange 47 3 lemon yellow 15 4 lime green 30 5 plum purple 28","tags":"SQL","url":"redoules.github.io/sql/display_table_filter.html","loc":"redoules.github.io/sql/display_table_filter.html"},{"title":"Displaying the content of a TABLE","text":"Displaying the content of a TABLE in SQL In this very simple example we will see how to display the content of a table. Since we are working in the notebook, we will load the sql extension in order to manipulate the database. The database mydatabase.db is a SQLite database already created before the example. #load the extension % load_ext sql #connect to the database % sql sqlite : /// mydatabase . db 'Connected: @mydatabase.db' In order to extract all the values from a table, we will use the following command : SELECT * FROM TABLE In our example, we want to display the data contained in the table named tutyfrutty % sql SELECT * FROM tutyfrutty * sqlite:///mydatabase.db Done. index fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 7 Cranberry red 308","tags":"SQL","url":"redoules.github.io/sql/display_table.html","loc":"redoules.github.io/sql/display_table.html"},{"title":"Opening a file with python","text":"This short article show you how to open a file using python. We will use the with keyword in order to avoid having to close the file. There is no need to import anything in order to open a file. All the function related to file manipulation are part of the python standard library In order to open a file, we will use the function open. This function takes two arguments : the path of the file the mode you want to open the file The mode can be : 'r' : read 'w' : write 'a' : append (writes at the end of the file) 'b' : binary mode 'x' : exclusive creation 't' : text mode (by default) Note that if the file does not exit it will be created if you use the following options \"w\", \"a\", \"x\". If you try to open a non existing file in read mode 'r', a FileNotFoundError will be returned. It is possible to combine multiple options together. For instance, you can open a file in binary mode for writing using the 'wb' option. Python distinguishes between binary and text I/O. Files opened in binary mode return contents as bytes objects without any decoding. In text mode , the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given. Writing to a file Let's first open (create) a text file a write a string to it. filepath = \". \\\\ myfile.txt\" with open ( filepath , 'w' ) as f : f . write ( \"Hello world !\" ) Reading a file we can now see how to read the content of a file. To do so, we will use the 'r' option with open ( filepath , \"r\" ) as f : content = f . read () print ( content ) Hello world ! A word on the with keyword In python the with keyword is used when working with unmanaged resources (like file streams). The python documentation tells us that : The with statement clarifies code that previously would use try...finally blocks to ensure that clean-up code is executed. In this section, I'll discuss the statement as it will commonly be used. In the next section, I'll examine the implementation details and show how to write objects for use with this statement. The with statement is a control-flow structure whose basic structure is: with expression [ as variable ]: with - block The expression is evaluated, and it should result in an object that supports the context management protocol (that is, has enter () and exit () methods).","tags":"Python","url":"redoules.github.io/python/Opening_file.html","loc":"redoules.github.io/python/Opening_file.html"},{"title":"Opening a SQLite database with python","text":"This short article show you how to connect to a SQLite database using python. We will use the with keyword in order to avoid having to close the database. In order to connect to the database, we will have to import sqlite3 import sqlite3 from sqlite3 import Error In python the with keyword is used when working with unmanaged resources (like file streams). The python documentation tells us that : The with statement clarifies code that previously would use try...finally blocks to ensure that clean-up code is executed. In this section, I'll discuss the statement as it will commonly be used. In the next section, I'll examine the implementation details and show how to write objects for use with this statement. The with statement is a control-flow structure whose basic structure is: with expression [ as variable ]: with - block The expression is evaluated, and it should result in an object that supports the context management protocol (that is, has enter () and exit () methods). db_file = \". \\\\ mydatabase.db\" try : with sqlite3 . connect ( db_file ) as conn : print ( \"Connected to the database\" ) #your code here except Error as e : print ( e ) Connected to the database","tags":"Python","url":"redoules.github.io/python/Opening_SQLite_database.html","loc":"redoules.github.io/python/Opening_SQLite_database.html"},{"title":"Reading data from a sql database with pandas","text":"When manipulating you data using pandas, it is sometimes useful to pull data from a database. In this tutorial, we will see how to querry a dataframe from a sqlite table. Note than it would also work with any other sql database a long as you change the connxion to the one that suits your needs. First let's import pandas and sqlite3 import pandas as pd import sqlite3 from sqlite3 import Error We want to store the table tutyfrutty in our dataframe. To do so, we will query all the elements present in the tutyfrutty TABLE with the command : SELECT * FROM tutyfrutty db_file = \". \\\\ mydatabase.db\" try : with sqlite3 . connect ( db_file ) as conn : df = pd . read_sql ( \"SELECT * FROM tutyfrutty\" , conn ) del df [ \"index\" ] #juste delete the index column that was stored in the table except Error as e : print ( e ) df fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 6 Cranberry red 308 7 Cranberry red 308","tags":"Python","url":"redoules.github.io/python/Reading_data_from_a_sql_database_with_pandas.html","loc":"redoules.github.io/python/Reading_data_from_a_sql_database_with_pandas.html"},{"title":"Writing data to a sql database with pandas","text":"When manipulating you data using pandas, it is sometimes useful to store a dataframe. Pandas provides multiple ways to export dataframes. The most common consist in exporting to a csv, a pickle, to hdf or to excel. However, exporting to a sql database can prove very useful. Indeed, having a well structured database is a great for storing all the data related to your analysis in one place. In this tutorial, we will see how to store a dataframe in a new table of a sqlite dataframe. Note than it would also work with any other sql database a long as you change the connxion to the one that suits your needs. First let's import pandas and sqlite3 import pandas as pd import sqlite3 from sqlite3 import Error # Example dataframe raw_data = { 'fruit' : [ 'Banana' , 'Orange' , 'Apple' , 'lemon' , \"lime\" , \"plum\" ], 'color' : [ 'yellow' , 'orange' , 'red' , 'yellow' , \"green\" , \"purple\" ], 'kcal' : [ 89 , 47 , 52 , 15 , 30 , 28 ] } df = pd . DataFrame ( raw_data , columns = [ 'fruit' , 'color' , 'kcal' ]) df fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 Now that the DataFrame has been created, let's push it to the sqlite database called mydatabase.db in a new table called tutyfrutty db_file = \". \\\\ mydatabase.db\" try : with sqlite3 . connect ( db_file ) as conn : df . to_sql ( \"tutyfrutty\" , conn ) except Error as e : print ( e ) except ValueError : print ( \"The TABLE tutyfrutty already exists, read below to understand how to handle this case\" ) Note that if the table tutyfrutty was already existing, the to_sql function will return a ValueError. This is where, the if_exists option comes into play. Let's look at the docstring of this function : \"\"\" if_exists : {'fail', 'replace', 'append'}, default 'fail' - fail: If table exists, do nothing. - replace: If table exists, drop it, recreate it, and insert data. - append: If table exists, insert data. Create if does not exist. \"\"\" Let's say, I want to update my dataframe with some new rows df . loc [ len ( df ) + 1 ] = [ 'Cranberry' , 'red' , 308 ] df fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 7 Cranberry red 308 8 Cranberry red 308 I can now replace the table with the new values using the \"replace\" option db_file = \". \\\\ mydatabase.db\" try : with sqlite3 . connect ( db_file ) as conn : df . to_sql ( \"tutyfrutty\" , conn , if_exists = \"replace\" ) except Error as e : print ( e )","tags":"Python","url":"redoules.github.io/python/Writing_data_to_a_sql_database_with_pandas.html","loc":"redoules.github.io/python/Writing_data_to_a_sql_database_with_pandas.html"},{"title":"Creating a sqlite database","text":"When you want to start with using databases SQlite is a great tool. It provides an easy onramp to learn and prototype you database with a SQL compatible database. First, let's import the libraries we need import sqlite3 from sqlite3 import Error SQlite doesn't need a database server, however, you have to start by creating an empty database file import os def check_for_db_file (): if os . path . exists ( \"mydatabase.db\" ): print ( \"the database is ready\" ) else : print ( \"no database found\" ) check_for_db_file () no database found Let's then create a function that will connect to a database, print the verison of sqlite and then close the connexion to the database. def create_database ( db_file ): \"\"\" create a database connection to a SQLite database \"\"\" try : with sqlite3 . connect ( db_file ) as conn : print ( \"database created with sqlite3 version {0} \" . format ( sqlite3 . version )) except Error as e : print ( e ) create_database ( \".\\mydatabase.db\" ) database created with sqlite3 version 2.6.0 check_for_db_file () the database is ready You're all set. From now on, you can open the database and write sql querries into it.","tags":"Python","url":"redoules.github.io/python/Creating_a_sqlite_database.html","loc":"redoules.github.io/python/Creating_a_sqlite_database.html"},{"title":"Setting up the notebook for plotting with matplotlib","text":"Importing Matplotlib First we need to import pyplot, a collection of command style functions that make matplotlib work like MATLAB. Let's, as well, use the magic command %matplotlib inline in order to display the figures in the notebook import matplotlib.pyplot as plt % matplotlib inline # this doubles image size, but we'll do it manually below # %config InlineBackend.figure_format = 'retina' The following parameters are recommended for matplotlib, they will make matplotlib output a better quality image # %load snippets/matplot_setup.py plt . rcParams [ 'savefig.dpi' ] = 300 plt . rcParams [ 'figure.dpi' ] = 163 plt . rcParams [ 'figure.autolayout' ] = False plt . rcParams [ 'figure.figsize' ] = 20 , 12 plt . rcParams [ 'axes.labelsize' ] = 18 plt . rcParams [ 'axes.titlesize' ] = 20 plt . rcParams [ 'font.size' ] = 16 plt . rcParams [ 'lines.linewidth' ] = 2.0 plt . rcParams [ 'lines.markersize' ] = 8 plt . rcParams [ 'legend.fontsize' ] = 14 plt . rcParams [ 'text.usetex' ] = False # True activates latex output in fonts! plt . rcParams [ 'font.family' ] = \"serif\" plt . rcParams [ 'font.serif' ] = \"cm\" plt . rcParams [ 'text.latex.preamble' ] = \" \\\\ usepackage {subdepth} , \\\\ usepackage {type1cm} \" You can change the second line in order to fit your display. 163 dpi corresponds to a Dell Ultra HD 4k P2715Q. You can check your screen's dpi count at http://dpi.lv/","tags":"Python","url":"redoules.github.io/python/Setting_up_the_notebook_for_plotting_with_matplotlib.html","loc":"redoules.github.io/python/Setting_up_the_notebook_for_plotting_with_matplotlib.html"},{"title":"Why using a blockchain is a bad idea for your business","text":"What having a blockchain implies? storage costs : everyone maintaining the ledger needs to store every transaction bandwith costs : everyone has to broadcast every transaction computational costs : every node has to validate the blockchain control : the creator does not control the blockchain, everyone collectively controls it developpement costs : developping on a blockchain is way harder than on a traditionnal database What to ask a business when they tell you that they are using a blockchain? When a business is telling you about their innovative technology leveraging the power of the blockchain this should immedialty spake some questions : What is the consensus algorithm? who is responsible for validating the consensus rules? what is the nature of the participation ? is it open to access? is it open to innovation? is it a public ledger? is it transparent? does it improves acountability? is it cross borders? how is it validated?","tags":"Cryptocurrencies","url":"redoules.github.io/cryptocurrencies/blockchain_bad.html","loc":"redoules.github.io/cryptocurrencies/blockchain_bad.html"},{"title":"Synology NFS share","text":"Setting up a NFS share login to your DSM admin account, open the \"Control Panel\" and go to \"File Services\" Make sure NFS is enabled Back in the control panel, go to \"Shared Folder\" Select the folder you want to share and clic \"Edit\" Go to the \"NFS Permissions tab and clic \"Create\", add the ip of the device you want to mount the mapped drive on. Make sure you copy the \"Mount path\"","tags":"Linux","url":"redoules.github.io/linux/share_nfs_share.html","loc":"redoules.github.io/linux/share_nfs_share.html"},{"title":"Mount a NFS share using fstab","text":"Mount nfs using fstab The fstab file, generally located at /etc/fstab lists the differents partitions and where to load them on the filesystem. You can edit this file as root by using the following command sudo nano /etc/fstab in the following example, we want to mount a NFS v3 share from : * server : 192.168.1.2 * mountpoint (on the server) : /volumeUSB2/usbshare * mountlocation (on the client) : /mnt we specify 192 .168.1.2:/volumeUSB2/usbshare /mnt nfs nfsvers = 3 ,users 0 0 the client will then automatically mount the share ont /mnt at startup. Related you can reload the fstab file using this method : https://redoules.github.io/linux/Reloading_fstab.html You can create a NFS share on a Synology using the method : https://redoules.github.io/linux/share_nfs_share.html","tags":"Linux","url":"redoules.github.io/linux/mount_nfs_share_fstab.html","loc":"redoules.github.io/linux/mount_nfs_share_fstab.html"},{"title":"Installing bitcoind on raspberry pi","text":"Installing bitcoind on linux Running a full bitcoin node helps the bitcoin network to accept, validate and relay transactions. If you want to volunteer some spare computing and bandwidth resources to run a full node and allow Bitcoin to continue to grow you can grab an inexpensive and power efficient raspberry pi and turn it into a full node. There are plenty of tutorials on the Internet explaining how to install a bitcoin full node; this tutorial won't go over setting up a raspberry pi and using ssh. In order to store the full blockchain we will mount a network drive and tell bitcoind to use this mapped drive as the data directory. Download the bitcoin client Go to https://bitcoin.org/en/download Copy the URL for the ARM 32 bit version and download it onto your raspberry pi. wget https://bitcoin.org/bin/bitcoin-core-0.15.1/bitcoin-0.15.1-arm-linux-gnueabihf.tar.gz Locate the downloaded file and extract it using the arguement xzf tar xzf bitcoin-0.15.1-arm-linux-gnueabihf.tar.gz a new directory bitcoin-0.15.1 will be created, it contrains the files we need to install the software Install the bitcoin client We will install the content by copying the binaries located in the bin folder into /usr/local/bin by using the install command. You must use sudo because it will write data to a system directory sudo install -m 0755 -o root -g root -t /usr/local/bin bitcoin-0.15.1/bin/* Launch the bitcoin core client by running bitcoind -daemon Configuration of the node Start your node at boot Starting you node automatically at boot time is a good idea because it doesn't require a manual action from the user. The simplest way to achive this is to create a cronjob. Run the following command crontab -e Select the text editor of your choice, then add the following line at the end of the file @reboot bitcoind -daemon Save the file and exit; the updated crontab file will be installed for you. Full Node If you can afford to download and store all the blockchain, you can run a full node. At the time of writing, the blockchain is 150Go ( https://blockchain.info/fr/charts/blocks-size ). Tree ways to store this are : * use a microSD with 256Go or more * add a thumbdrive or an external drive to your raspberry pi * mount a network drive from a NAS If you have purchased a big SD card then you can leave the default location for the blockchain data (~/.bitcoin/). Otherwise, you will have to change the datadir location to where your drive is mounted (in my case I have mounted it to /mnt) In order to configure your bitcoin client, edit/create the file bitcoin.conf located in ~/.bitcoin/ nano ~/.bitcoin/bitcoin.conf copy the following text # From redoules.github.io # This config should be placed in following path: # ~/.bitcoin/bitcoin.conf # [core] # Specify a non-default location to store blockchain and other data. datadir=/mnt # Set database cache size in megabytes; machines sync faster with a larger cache. Recommend setting as high as possible based upon mach$ dbcache=100 # Keep at most <n> unconnectable transactions in memory. maxorphantx=10 # Keep the transaction memory pool below <n> megabytess. maxmempool=50 # [network] # Maintain at most N connections to peers. maxconnections=40 # Tries to keep outbound traffic under the given target (in MiB per 24h), 0 = no limit. maxuploadtarget=5000 Check https://jlopp.github.io/bitcoin-core-config-generator it is a handy site to edit the bitcoin.conf file Pruning node If you don't want to store the entire blockchain you can run a pruning node which reduces storage requirements by enabling pruning (deleting) of old blocks. Let's say you want to allocated at most 5Go to the blockchain, then specify prune=5000 into your bitcoin.conf file. Edit/create the file bitcoin.conf located in ~/.bitcoin/ nano ~/.bitcoin/bitcoin.conf copy the following text # From redoules.github.io # This config should be placed in following path: # ~/.bitcoin/bitcoin.conf # [core] # Set database cache size in megabytes; machines sync faster with a larger cache. Recommend setting as high as possible based upon mach$ dbcache=100 # Keep at most <n> unconnectable transactions in memory. maxorphantx=10 # Keep the transaction memory pool below <n> megabytess. maxmempool=50 # Reduce storage requirements by only storing most recent N MiB of block. This mode is incompatible with -txindex and -rescan. WARNING: Reverting this setting requires re-downloading the entire blockchain. (default: 0 = disable pruning blocks, 1 = allow manual pruning via RPC, greater than 550 = automatically prune blocks to stay under target size in MiB). prune=5000 # [network] # Maintain at most N connections to peers. maxconnections=40 # Tries to keep outbound traffic under the given target (in MiB per 24h), 0 = no limit. maxuploadtarget=5000 Checking if your node is public one of the best way to help the bitcoin network is to allow your node to be visible and to propagate block to other nodes. The bitcoin protocole uses port 8333, other clients should be able to share information with your client. Run ifconfig and check if you have an ipv6 adresse (look for adr inet6:) IPV6 Get the global ipv6 adresse of your raspberry pi Link encap:Ethernet HWaddr xx:xx:xx:xx:xx:xx inet adr:192.168.1.x Bcast:192.168.1.255 Masque:255.255.255.0 adr inet6: xxxx::xxxx:xxxx:xxxx:xxxx/64 Scope:Lien adr inet6: xxxx:xxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx/64 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:42681744 errors:0 dropped:0 overruns:0 frame:0 TX packets:38447218 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 lg file transmission:1000 RX bytes:3044414780 (2.8 GiB) TX bytes:2599878680 (2.4 GiB) it is located between adr inet4 and Scope:Global adr inet6: xxxx:xxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx/64 Scope:Global Copy this adresse and past it into the search field on https://bitnodes.earn.com/ If your node is visible, it will appear on the website IPV4 If you don't have an ipv6 adresse, you will have to open port 8333 on your router and redirect it to the internal IP of your raspberry pi. It is not detailed here because the configuration depends on your router.","tags":"Cryptocurrencies","url":"redoules.github.io/cryptocurrencies/Installing_bitcoind_on_raspberry_pi.html","loc":"redoules.github.io/cryptocurrencies/Installing_bitcoind_on_raspberry_pi.html"},{"title":"Reloading .bashrc","text":"Reload .bashrc The .bashrc file, located at ~/.bashrc allows a user to personalize its bash shell. If you edit this file, the changes won't be loaded without login out and back in. However, you can use the following command to do it source ~/.bashrc","tags":"Linux","url":"redoules.github.io/linux/Reloading_.bashrc.html","loc":"redoules.github.io/linux/Reloading_.bashrc.html"},{"title":"Reloading fstab","text":"Reload fstab The fstab file, generally located at /etc/fstab lists the differents partitions and where to load them on the filesystem. If you edit this file, the changes won't be automounted. You either have to reboot your system of use the following command as root mount -a","tags":"Linux","url":"redoules.github.io/linux/Reloading_fstab.html","loc":"redoules.github.io/linux/Reloading_fstab.html"},{"title":"Updating all python package with anaconda","text":"Updating anaconda packages All packages managed by conda can be updated with the following command : conda update --all Updating other packages with pip For the other packages, the pip package manager can be used. Unfortunately pip hasn't the same update all fonctionnality. import pip from subprocess import call for dist in pip . get_installed_distributions (): print ( \"updating {0} \" . format ( dist )) call ( \"pip install --upgrade \" + dist . project_name , shell = True )","tags":"Python","url":"redoules.github.io/python/updating_all_python_package_with_anaconda.html","loc":"redoules.github.io/python/updating_all_python_package_with_anaconda.html"},{"title":"Saving a matplotlib figure with a high resolution","text":"creating a matplotlib figure #Importing matplotlib % matplotlib inline import matplotlib.pyplot as plt import numpy as np Drawing a figure # Fixing random state for reproducibility np . random . seed ( 19680801 ) mu , sigma = 100 , 15 x = mu + sigma * np . random . randn ( 10000 ) # the histogram of the data n , bins , patches = plt . hist ( x , 50 , normed = 1 , facecolor = 'g' , alpha = 0.75 ) plt . xlabel ( 'Smarts' ) plt . ylabel ( 'Probability' ) plt . title ( 'Histogram of IQ' ) plt . text ( 60 , . 025 , r '$\\mu=100,\\ \\sigma=15$' ) plt . axis ([ 40 , 160 , 0 , 0.03 ]) plt . grid ( True ) plt . show () Saving the figure normally, one would use the following code plt . savefig ( 'filename.png' ) <matplotlib.figure.Figure at 0x2e45e92f400> The figure in then exported to the file \"filename.png\" with a standard resolution. In adittion, you can specify the dpi arg to some scalar value, for example: plt . savefig ( 'filename_hi_dpi.png' , dpi = 300 ) <matplotlib.figure.Figure at 0x2e462164898>","tags":"Python","url":"redoules.github.io/python/Saving_a_matplotlib_figure_with_a_high_resolution.html","loc":"redoules.github.io/python/Saving_a_matplotlib_figure_with_a_high_resolution.html"},{"title":"Iterate over a DataFrame","text":"Create a sample dataframe # Import modules import pandas as pd # Example dataframe raw_data = { 'fruit' : [ 'Banana' , 'Orange' , 'Apple' , 'lemon' , \"lime\" , \"plum\" ], 'color' : [ 'yellow' , 'orange' , 'red' , 'yellow' , \"green\" , \"purple\" ], 'kcal' : [ 89 , 47 , 52 , 15 , 30 , 28 ] } df = pd . DataFrame ( raw_data , columns = [ 'fruit' , 'color' , 'kcal' ]) df fruit color kcal 0 Banana yellow 89 1 Orange orange 47 2 Apple red 52 3 lemon yellow 15 4 lime green 30 5 plum purple 28 Using the iterrows method Pandas DataFrames can return a generator with the iterrrows method. It can then be used to loop over the rows of the DataFrame for index , row in df . iterrows (): print ( \"At line {0} there is a {1} which is {2} and contains {3} kcal\" . format ( index , row [ \"fruit\" ], row [ \"color\" ], row [ \"kcal\" ])) At line 0 there is a Banana which is yellow and contains 89 kcal At line 1 there is a Orange which is orange and contains 47 kcal At line 2 there is a Apple which is red and contains 52 kcal At line 3 there is a lemon which is yellow and contains 15 kcal At line 4 there is a lime which is green and contains 30 kcal At line 5 there is a plum which is purple and contains 28 kcal","tags":"Python","url":"redoules.github.io/python/Iterating_over_a_dataframe.html","loc":"redoules.github.io/python/Iterating_over_a_dataframe.html"}]}